Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Path: utzoo!utgpu!water!watmath!uunet!ig!daemon
From: daemon@ig.UUCP
Newsgroups: bionet.molbio.news
Subject: CSLG|COMMENTARY: From Ellis Golub (2)
Message-ID: <4260@ig.ig.com>
Date: Tue, 1-Dec-87 14:39:45 EST
Article-I.D.: ig.4260
Posted: Tue Dec  1 14:39:45 1987
Date-Received: Sat, 5-Dec-87 13:17:08 EST
Sender: daemon@presto.ig.com
Lines: 30

From: Sunil Maulik 

         Computer Applications in the Sequencing of Large Genomes

    At present, the Genbank database consists of approximately 14,000 
sequences comprising ~15 mb. To search this entire database using IFIND on 
BIONET is already not practical, and searches are often restricted to 
subsets of the total database. For example, the mammalian and unannotated 
sequences comprise about 7000 sequences consisting of more than 6 mb. A 
recent search of this segment of Genbank using a 1.6 kb probe required 
approximately 3 hours of cpu time on the BIONET computer in batch mode. 
The same program running on a VAX (about 5 times faster for the Sieve of 
Erosthenes benchmark) required about 45 minutes of cpu time. Using the 
faster Lipman and Pearson algorithm, XFASTN on BIONET, 1.7 mb in 1565 
sequences were searched in 9 min, while another implementation of the 
Wilbur and Lipman method on a VAX searched the mammalian and unannotated 
lists in about 20 min. As the search time is highly dependant on the probe 
size (smaller is faster) and the word size (larger is faster), these 
searches were conducted with approximately similar parameters. It was also 
somewhat distressing that several of these searches returned different 
lists of similar sequences.  Clearly, attempts to apply these techniques 
to the complete human genome (~ 30 gb; 2000 times larger than the current 
Genbank database) will strain all available facilities beyond the breaking 
point. The recent proposal to begin accumulating sequence data of this 
magnitude poses a clear challenge to the molecular biology software 
community to develop new and faster algorithms for new and faster hardware 
in order to provide tools capable of practical utilization of 
gigabase-databases.

-------