In sequence analyses, the following data base operations are often required.
Sequence databases are usually provided as ``flat file''s or ordinary
text data files. For efficient database operation, structured files
are preferable such that records, fields, and lines are distinguished.
In DeepForest, database files are treated nested lists such
as:
Database = [Record|Remained_Database], Record = [Field|Remained_Record], Field = [Line|Remained_Field].
Program make_records converts flat file to a structured data file
composed of records according to delimiter ``//''. Program
extract_fields extracts all fields indicated by user. For example:
make_records actin.dat actin.pl
generates a structured actin database file in terms of records.
extract_fields actin.pl actin.entry.pl ENTRY TITLE
extracts ENTRY and TITLE fields from the actin database file as follows:
[["ENTRY ATAX #type complete"],["TITLE actin - Acanthamoeba castellanii"]]. [["ENTRY ATAX #type complete"],["TITLE actin - Entamoeba histolytica"]]. [["ENTRY ATBO #type complete"],["TITLE actin beta - bovine (tentative sequence)"]]. [["ENTRY ATBO #type complete"],["TITLE actin gamma - bovine (tentative sequence)"]].
Or we can indicate more fields such as:
extract_fields actin.pl actin.entry.pl ENTRY TITLE REFERENCE KEYWORDS
extracts ENTRY, TITLE, REFERENCE, and KEYWORDS fields from the actin database file as follows:
[["ENTRY ATAX #type complete"],["TITLE actin - Acanthamoeba castellanii"],["REFERENCE A92886"," #authors Nellen, W.; Gallwitz, D."," #journal J. Mol. Biol. (1982) 159:1-18"," #title Actin genes and actin messenger RNA in Acanthamoeba"," castellanii. Nucleotide sequence of the split actin gene I."," #cross-references MUID:83033627"," #accession A92886"," ##molecule_type DNA"," ##residues 1-375 ##label NEL"," ##cross-references GB:J01016"],["KEYWORDS methylated amino acid"]]. [["ENTRY ATAXE #type complete"],["TITLE actin - Entamoeba histolytica"],["REFERENCE A29877"," #authors Edman, U.; Meza, I.; Agabian, N."," #journal Proc. Natl. Acad. Sci. U.S.A. (1987) 84:3024-3028"," #title Genomic and cDNA actin sequences from a virulent strain of"," Entamoeba histolytica."," #cross-references MUID:87204260"," #accession A29877"," ##molecule_type mRNA"," ##residues 1-376 ##label EDM"," ##experimental_source strain HM1:IMSS"],["KEYWORDS cell motility; cytoskeleton; methylated amino acid"]].
However, this representation is a little bit messy. beautify command beautifies the above file as follows:
[ ["ENTRY ATAX #type complete"], ["TITLE actin - Acanthamoeba castellanii"], ["REFERENCE A92886"," #authors Nellen, W.; Gallwitz, D."," #journal J. Mol. Biol. (1982) 159:1-18"," #title Actin genes and actin messenger RNA in Acanthamoeba"," castellanii. Nucleotide sequence of the split actin gene I."," #cross-references MUID:83033627"," #accession A92886"," ##molecule_type DNA"," ##residues 1-375 ##label NEL"," ##cross-references GB:J01016"], ["KEYWORDS methylated amino acid"]]. [ ["ENTRY ATAXE #type complete"], ["TITLE actin - Entamoeba histolytica"], ["REFERENCE A29877"," #authors Edman, U.; Meza, I.; Agabian, N."," #journal Proc. Natl. Acad. Sci. U.S.A. (1987) 84:3024-3028"," #title Genomic and cDNA actin sequences from a virulent strain of"," Entamoeba histolytica."," #cross-references MUID:87204260"," #accession A29877"," ##molecule_type mRNA"," ##residues 1-376 ##label EDM"," ##experimental_source strain HM1:IMSS"], ["KEYWORDS cell motility; cytoskeleton; methylated amino acid"]].
If you want to extract sequences:
extract_fields actin.pl actin.entry.pl ENTRY SEQUENCE
make_seq actin.entry.pl actin.seq
If we indicate all fields appeared in a database (ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS, REFERENCE, COMMENT, GENETICS, CLASSIFICATION, KEYWORDS, FEATURE, SUMMARY, and SEQUENCE), the whole flat file will be structured as nested lists which have corresponding records, fields, and lines. Once the database is converted in this way, efficiency in searching for certain fields of the whole database will be improved.