I developed a new method to cluster sequence data using differences of sequence lengths as certain distances. Precisely speaking, the distance between sequences i and j is defined by
where li and lj are lengths (number of sites) of the ith and jth sequences, respectively. When sequences i and j have identical lengths, dij=0. When lengths of these two sequences are different, it is assumed that deletion and/or insertion events occurred during divergence of these sequences. Therefore, the difference between their lengths is the lower limit of the number of deletion or insertion events. This distance is expected to more or less reflect the ``true'' distance matrix (substitution matrix) that is obtained after the multiple alignment is performed. Of course, if both deletions and insertions occurred in the sequences, the number of deletion/insertion events is not equivalent to the difference of sequence lengths. Yet, we can obtain some information from those distances. We take logarithm of the minimum number of deletion or insertion events in equation (1.9) because the log-transformation generated distinct clusters (at least for the myosin heavy chain family). We call the above distance Minimum Gap Distance (MGD). Each classified sequence group was aligned, and then joined using the profile option of ClustalW. Finally, these results were combined sequentially and multiplly aligned in iterative manner until no improvement was not found. After reconstruction of the trees, we classified sequences in terms of the tissue classes defined above.
Using program make_gap_matrix, a MGD matrix is
constructed to make a guide tree. For example:
make_gap_matrix mhc.seq
generates Minimum Gap Distance (MGD) matrix, and writes it to standard output.