Path: utzoo!attcan!uunet!ginosko!usc!ucsd!hub!tangello!dz From: dz@tangello.ucsb.edu (Daniel James Zerkle) Newsgroups: comp.sys.next Subject: Re: NeXT Database Prowess Keywords: indexing digital librarian Message-ID: <2211@hub.UUCP> Date: 15 Aug 89 05:33:00 GMT References: <19350@vax5.CIT.CORNELL.EDU> Sender: news@hub.UUCP Reply-To: dz@cornu.ucsb.edu (Daniel James Zerkle) Organization: University of California, Santa Barbara Lines: 63 In article <19350@vax5.CIT.CORNELL.EDU> fqoj@vax5.cit.cornell.edu () writes: > ...they'd >like to have that database "NeXTized" or whatever the process is >called. A similar situation is for a unit studying the works of Plato. >What exactly is the process going on "under" the Shakespear icon, it >can't be just a glorified fgrep. How does the cube, burdened with the >unix file-system, get such good recall on that large database? There is a fairly straightforward implemetation of inverted indices. That is, keywords are sifted out from the original text, sorted, and hashed. When the digital librarian looks for a word, it has three files (set up previously) that are exceedingly fast to search, due to the way they are arranged (hashed and sorted). Once they are found there, the keywords reference the individual files and locations of the original text. And actually, it is possible to turn off the indexes and use fgrep, which is necessary to search for certain sophisticated patterns (parts of words) that the indexes can't handle. This is similar to the REFER database system already implemented on any Berkeley (and maybe sys V, yo no se). It is a bit more sophisticated, as there are systems for indexing multitudes of different kinds of files, and more information is available about the objects searched after a key is found and before it is looked up. >Is there a way Cornell could send the disk data to NeXT, or even a third >party, and have them put the data on an OD with the proper >cross-indexing? Not necessary. Just drag the folder (i.e. directory) from the directory browser to an empty icon well in the Digital Librarian. You can index the files from a menu selection (I forget which), but be careful, as DL has a bug that makes it think it has an indexed directory when it isn't really indexed. >We'd want to do the front-end ourselves in IB >(obviously, where's the fun without that chance? :-) ) I am planning on immediately starting a similar project. Perhaps we should share our work. I need to expand on the capabilities that DL just doesn't provide (diplay troff text properly). >Is there a way to licence the underlying software that drives such >cross-referenced databases? Is this a NeXT-developed technology or third >party? Obviously the potential is great for any field to have their "hot >topics" ready and on-line in such a fashion. Will it be part of a future >OS release. Maybe something like AppKit only this would be called >DataBaseKit? You already have the software. There are a bunch of poorly documented function calls (well, not all THAT poorly documented) to handle all the indexing stuff. It is not objective C, but just the ordinary stuff. Search in the digital librarian for "index" and "indexing" under the release notes and the manual pages, and you'll find all sorts of stuff. I recommend you start from a terminal with "man 1 index", and follow the cross references. I responded here because I thought some of this stuff is of general interest, but I would really like to work with you, as I think we could help each other out a lot. Please send mail. | Dan Zerkle home:(805) 968-4683 morning:961-2434 afternoon:687-0110 | | dz@cornu.ucsb.edu dz%cornu@ucsbuxa.bitnet ...ucbvax!hub!cornu!dz | | Snailmail: 6681 Berkshire Terrace #5, Isla Vista, CA 93117 | | Disclaimer: If it's wrong or stupid, pretend I didn't do it. |