next up previous
Next: Generation of huge bibliography Up: Database operation Previous: Database operation

Searching database

In sequence analyses, the following data base operations are often required.

1.
Searching records with strings as queries in database.

2.
Searching fields with string as queries in database.

Sequence databases are usually provided as ``flat file''s or ordinary text data files. For efficient database operation, structured files are preferable such that records, fields, and lines are distinguished. In DeepForest, database files are treated nested lists such as:



Database = [Record|Remained_Database],
Record = [Field|Remained_Record], 
Field = [Line|Remained_Field].



Program make_records converts flat file to a structured data file composed of records according to delimiter ``//''. Program extract_fields extracts all fields indicated by user. For example:



 
make_records actin.dat actin.pl



generates a structured actin database file in terms of records.



 
extract_fields actin.pl actin.entry.pl ENTRY TITLE



extracts ENTRY and TITLE fields from the actin database file as follows:



 
[["ENTRY      ATAX   #type complete"],["TITLE      actin - Acanthamoeba castellanii"]].
[["ENTRY      ATAX   #type complete"],["TITLE      actin - Entamoeba histolytica"]].
[["ENTRY      ATBO   #type complete"],["TITLE      actin beta - bovine (tentative sequence)"]].
[["ENTRY      ATBO   #type complete"],["TITLE      actin gamma - bovine (tentative sequence)"]].



Or we can indicate more fields such as:



 
extract_fields actin.pl actin.entry.pl ENTRY TITLE REFERENCE KEYWORDS



extracts ENTRY, TITLE, REFERENCE, and KEYWORDS fields from the actin database file as follows:



 
[["ENTRY            ATAX       #type complete"],["TITLE
actin - Acanthamoeba castellanii"],["REFERENCE        A92886","
#authors      Nellen, W.; Gallwitz, D.","   #journal
J. Mol. Biol. (1982) 159:1-18","   #title        Actin genes and actin
messenger RNA in Acanthamoeba","
castellanii. Nucleotide sequence of the split actin gene I.","
#cross-references MUID:83033627","   #accession    A92886","
##molecule_type DNA","      ##residues      1-375 ##label NEL","
##cross-references GB:J01016"],["KEYWORDS         methylated amino
acid"]].
[["ENTRY            ATAXE      #type complete"],["TITLE
actin - Entamoeba histolytica"],["REFERENCE        A29877","
#authors      Edman, U.; Meza, I.; Agabian, N.","   #journal
Proc. Natl. Acad. Sci. U.S.A. (1987) 84:3024-3028","   #title
Genomic and cDNA actin sequences from a virulent strain of","
Entamoeba histolytica.","   #cross-references MUID:87204260","
#accession    A29877","      ##molecule_type mRNA","      ##residues
1-376 ##label EDM","      ##experimental_source strain
HM1:IMSS"],["KEYWORDS         cell motility; cytoskeleton; methylated
amino acid"]].



However, this representation is a little bit messy. beautify command beautifies the above file as follows:



 
[
["ENTRY            ATAX       #type complete"],
["TITLE            actin - Acanthamoeba castellanii"],
["REFERENCE        A92886","   #authors      Nellen, W.; Gallwitz,
D.","   #journal      J. Mol. Biol. (1982) 159:1-18","   #title
Actin genes and actin messenger RNA in Acanthamoeba","
castellanii. Nucleotide sequence of the split actin gene I.","
#cross-references MUID:83033627","   #accession    A92886","
##molecule_type DNA","      ##residues      1-375 ##label NEL","
##cross-references GB:J01016"],
["KEYWORDS         methylated amino acid"]].
[
["ENTRY            ATAXE      #type complete"],
["TITLE            actin - Entamoeba histolytica"],
["REFERENCE        A29877","   #authors      Edman, U.; Meza, I.;
Agabian, N.","   #journal      Proc. Natl. Acad. Sci. U.S.A. (1987)
84:3024-3028","   #title        Genomic and cDNA actin sequences from
a virulent strain of","                   Entamoeba histolytica.","
#cross-references MUID:87204260","   #accession    A29877","
##molecule_type mRNA","      ##residues      1-376 ##label EDM","
##experimental_source strain HM1:IMSS"],
["KEYWORDS         cell motility; cytoskeleton; methylated amino acid"]].



If you want to extract sequences:



 
extract_fields actin.pl actin.entry.pl ENTRY SEQUENCE



will work. Furthermore, make_seq command arranges output of the above command to reveal FASTA format as follows:



 
make_seq actin.entry.pl actin.seq



If we indicate all fields appeared in a database (ENTRY, TITLE, ORGANISM, DATE, ACCESSIONS, REFERENCE, COMMENT, GENETICS, CLASSIFICATION, KEYWORDS, FEATURE, SUMMARY, and SEQUENCE), the whole flat file will be structured as nested lists which have corresponding records, fields, and lines. Once the database is converted in this way, efficiency in searching for certain fields of the whole database will be improved.


next up previous
Next: Generation of huge bibliography Up: Database operation Previous: Database operation
Satoshi OOta
1999-03-06