next up previous
Next: Automatic tree annotation from Up: Semi-automatic multiple alignment Previous: Tree of Minimum Gap

Tree of amino acid composition

Although the tree using by MGD matrix provides approximate clustering, it is not sufficiently accurate in terms of biological sense (fig. [*]).

Here, I propose a new definition of a distance using frequencies of amino acids. The distance between sequence a and b, d(a,b), is defined as follows:


 \begin{displaymath}d(a,b) = \log(\sum_{i=1}^{20} ({a_i} - {b_i})^2 + 1),
\end{displaymath} (1.10)

where ai and bi are frequencies of amino acid i in sequences a and b, respectively. Note that $\sum_{i=1}^{20} ({a_i} -
{b_i})^2$ in equation (1.10) is the square of the distance between two vectors whose elements are amino acid frequencies.

For example:



make_freq_matrix mhc.seq



generates the Amino acids Composition Distance (ACD) matrix for myosin heavy chain genes, and writes it to standard output. Using this distance, whole myosin heavy chain data in PIR could be clustered as shown in fig. 1.3. This is a quite good categorization in terms of biological senses. Fig. 1.3 is considered as a guide tree without multiple alignment.


  
Figure 1.3: A guide tree without multiple alignment. This tree was reconstructed using amino acid composition.
\begin{figure}
\begin{center}
\epsfysize=18.5cm
\epsffile{fd.eps}\end{center}\end{figure}


next up previous
Next: Automatic tree annotation from Up: Semi-automatic multiple alignment Previous: Tree of Minimum Gap
Satoshi OOta
1999-03-06