Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!uunet!ig!daemon
From: daemon@ig.UUCP
Newsgroups: bionet.molbio.news
Subject: CSLG|COMMENTARY: From Ellis Golub (6)
Message-ID: <4265@ig.ig.com>
Date: Tue, 1-Dec-87 14:47:24 EST
Article-I.D.: ig.4265
Posted: Tue Dec  1 14:47:24 1987
Date-Received: Sat, 5-Dec-87 13:19:27 EST
Sender: daemon@presto.ig.com
Lines: 30

From: Sunil Maulik 

         Computer Applications in the Sequencing of Large Genomes

    The current DNA and protein databases are unidimensional and 
idiosyncratic. Future databases should be relational so that each sequence 
is linked to: 1) biologically related DNA sequences, 2) physically related 
DNA (map coordinates perhaps) and 3) derived sequence and other data 
including protein sequences, regulatory elements and other features. 
Moreover, these linkages will have to be standardized and pre-indexed, so 
that each database query does not have to begin from scratch. By heavily 
coding and indexing the data, search times can be brought to manageable 
limits, and relational patterns amongst sequences will become evident. In 
addition, considerable thought must be given to the mechanism of coding 
"features" associated with sequence data. The present method of including 
comments as keys to the structure of the sequence, as well as the location 
of functional sites and chemical modifications is not optimal for rapid 
searching and relational indexing. Two alternative schemes seem worth 
considering: 1) an "obligate" feature table for all sequences with a 
defined data structure which can be compactly coded (ASCII text tags waste 
space and time) and rapidly analyzed, or 2) sequence punctuation in the 
form of extended character sets or parenthetical signals. The criteria for 
evaluation of sequence annotation should be focussed on the generality, 
openness and utility of the method, rather than on parochial 
considerations involving current methods. In the gigabase future, there is 
no place for random comments and ad hoc structure definitions. We must 
choose rational and utilitarian methods for sequence data storage and 
management or face the prospect of a modern Tower of Babel. 

-------