Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Path: utzoo!utgpu!water!watmath!uunet!ig!daemon From: daemon@ig.UUCP Newsgroups: bionet.molbio.news Subject: CSLG|COMMENTARY: From Ellis Golub (6) Message-ID: <4265@ig.ig.com> Date: Tue, 1-Dec-87 14:47:24 EST Article-I.D.: ig.4265 Posted: Tue Dec 1 14:47:24 1987 Date-Received: Sat, 5-Dec-87 13:19:27 EST Sender: daemon@presto.ig.com Lines: 30 From: Sunil MaulikComputer Applications in the Sequencing of Large Genomes The current DNA and protein databases are unidimensional and idiosyncratic. Future databases should be relational so that each sequence is linked to: 1) biologically related DNA sequences, 2) physically related DNA (map coordinates perhaps) and 3) derived sequence and other data including protein sequences, regulatory elements and other features. Moreover, these linkages will have to be standardized and pre-indexed, so that each database query does not have to begin from scratch. By heavily coding and indexing the data, search times can be brought to manageable limits, and relational patterns amongst sequences will become evident. In addition, considerable thought must be given to the mechanism of coding "features" associated with sequence data. The present method of including comments as keys to the structure of the sequence, as well as the location of functional sites and chemical modifications is not optimal for rapid searching and relational indexing. Two alternative schemes seem worth considering: 1) an "obligate" feature table for all sequences with a defined data structure which can be compactly coded (ASCII text tags waste space and time) and rapidly analyzed, or 2) sequence punctuation in the form of extended character sets or parenthetical signals. The criteria for evaluation of sequence annotation should be focussed on the generality, openness and utility of the method, rather than on parochial considerations involving current methods. In the gigabase future, there is no place for random comments and ad hoc structure definitions. We must choose rational and utilitarian methods for sequence data storage and management or face the prospect of a modern Tower of Babel. -------