Although the tree using by MGD matrix provides approximate clustering,
it is not sufficiently accurate in terms of biological sense
(fig. ).
Here, I propose a new definition of a distance using frequencies of amino acids. The distance between sequence a and b, d(a,b), is defined as follows:
For example:
make_freq_matrix mhc.seq
generates the Amino acids Composition Distance (ACD) matrix for myosin heavy chain genes, and writes it to standard output. Using this distance, whole myosin heavy chain data in PIR could be clustered as shown in fig. 1.3. This is a quite good categorization in terms of biological senses. Fig. 1.3 is considered as a guide tree without multiple alignment.
![]() |