Path: utzoo!utgpu!watmath!uunet!bionet!lanl.gov!cb%intron
From: cb%intron@LANL.GOV (Christian Burks)
Newsgroups: bionet.molbio.genbank
Subject: genbank update cycle and submitted data lag time
Message-ID: <8811201524.AA03358@intron.lanl.gov>
Date: 20 Nov 88 15:24:55 GMT
Sender: kristoff@NET.BIO.NET
Lines: 114

Dear Dr. Smith:

I gather that Dave Benton answered some of your questions...I've
made stab at answering the others.

Thanks for your concern and interest.

Christian Burks

> Return-Path: 
> Received: from rutgers.edu by BIONET-20.BIO.NET with TCP; Sat 5 Nov 88 02:34:36-PST
> Received: by rutgers.edu (5.59/1.15) 
> 	id AA06289; Sat, 5 Nov 88 05:34:19 EST
> Received: by phri.phri (5.51/5.17)
> 	id AA11252; Fri, 4 Nov 88 14:05:43 EST
> Received: by alanine.phri (3.2/5.17)
> 	id AA24967; Fri, 4 Nov 88 14:07:05 EST
> Date: Fri, 4 Nov 88 14:07:05 EST
> From: phri!alanine.phri!roy@nyu.edu (Roy Smith)
> Message-Id: <8811041907.AA24967@alanine.phri>
> To: benton@bionet-20.bio.net, nucall@bionet-20.bio.net
> Subject: Re: Missing entries in GenBank (cooperation with EMBL)
> Cc: roy@alanine.phri
> 
> Dr. Benton,
> 
> 	Thank you for checking this out for me.  While I am glad to know it
> will be in the next release, I do wonder if a 10 month (end of January to
> start of December) lag between publication and entry into the data base is
> too long.  Granted, we're looking at a worst case because we just missed a
> release and thus incurred an extra 3-month delay, but even 7 months seems
> like a long time.  Keep in mind, that this 7 or 10 month delay starts
> counting from the date the paper is published; given that most papers are 6
> months or more from submission to when it hits presses, and you're talking
> over a year from the time a sequence in known to when it's on-line.
> 
> 	What is a typical amount of time between publication of a sequence
> in a journal and when it goes out on a GenBank tape?  What amount of time
> is considered "good" by the GenBank staff (i.e. the target delay, beyond
> which subscribers should feel justified complaining about)?
> 

Most data are now getting into a public release of GenBank within
3-5 months of receipt date at LANL.  The major exception is for
data received prior to publication which the author requests we
withhold from public release until some future date (e.g., date
of publication in a journal article); in this case the data are
queued into a public release as soon as that date is reached.

(This may still fall short of the ideal ...but it should be compared
with 2 years ago when the average time from receipt at LANL to
appearance in the database was 12-14 months.  That was clearly
unacceptable, and we've put much effort and resources into
turning that around.)

When should a correspondent feel concerned enough to follow up?
At this point I would suggest that if more than one public
release has gone by since LANL received and acknowledged receipt
of the data (and if the author released the data for public
consumption at that point), they should definitely contact us.
If one submits data initially and doesn't receive an acknowledgement
within two weeks, that should be followed up on immediately.

> 	Are there any plans to have more frequent updates in-between the
> major quarterly releases?  I could envision once a week ftping the latest
> stuff from bionet-20 (or wherever the master copy is maintained, or perhaps
> there could be several repository sites around the country to reduce system
> load and network congestion).  These intermediate updates could be
> unannotated to get them out faster.  I could envision three levels an entry
> would go through.  First, as soon as possible, an unnanotated entry made
> available for ftp.  Second, each time a quarterly tape goes out, all those
> entries in the ftp area which are still not yet fully ready would be put on
> the tape as part of the current unannotated section.  Lastly, when entries
> are fully annotated, checked, indexed, and otherwise masaged into their
> final form, merged into the main data base.

There are many schemes (including that you suggest) for getting
incremental data out earlier...in fact, we did, until a year ago,
distribute an interim (six weeks) release that included only
"new" data...this was dropped because very few people requested
subscriptions to it and those that did admitted (with only 1-2
exceptions) that they didn't use it anyway.
Given the way that we were maintaining the data at that time, these
interim releases were very time consuming for us with little -- as
far as we could see -- benefit reaching the user community.

Over the next year, we will be shifting over to a data maintenance scheme
that will allow for the continuous or almost-continuous updating of
the database for internal maintenance...we hope by the end of the
year to have established some reflection of this continuity in
the distributed data, perhaps with even weekly updates being
available in some distributed form.

> 
> 	The idea is to get the sequence data out as fast as possible to the
> scientists who want to see it.  From what I see, I classify GenBank (and
> the same comments go pretty much for Dayhoff and other similar databases)
> usage into two catagories.  First is "I want to know the sequence of XXXX".
> This is straight-forward and if XXXX is not yet in the database, you find
> out fast.  If it's critical that you know about XXXX, you can always call
> the author or something like that.  The second one is the shot-in-the-dark
> search.  This latter one is where you really get killed by slow updates,
> because if you don't find something, you don't know what you missed.  These
> searches are often for sequence homolgies; but people just as often say
> "give me all the erythromycin resistance genes" or something like that, for
> which the same comments about the dangers of slow updates apply.
> -------

We share these concerns, and although we've made great strides in
this regard over the past 18 months, we believe the release cycle time
will be much more improved over the coming year.
> 
> 
>