Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site brl-tgr.ARPA
Path: utzoo!watmath!clyde!burl!ulysses!allegra!mit-eddie!genrad!panda!talcott!harvard!seismo!brl-tgr!gwyn
From: gwyn@brl-tgr.ARPA (Doug Gwyn )
Newsgroups: net.unix
Subject: Re: Unix text files
Message-ID: <2837@brl-tgr.ARPA>
Date: Tue, 5-Nov-85 00:29:30 EST
Article-I.D.: brl-tgr.2837
Posted: Tue Nov  5 00:29:30 1985
Date-Received: Thu, 7-Nov-85 04:33:55 EST
References: <23@pixel.UUCP> <2235@brl-tgr.ARPA> <2333@flame.warwick.UUCP>
Organization: Ballistic Research Lab
Lines: 52

> Does anyone out there want to show those of us with weak knees how one
> would use this kind of data structure [used loosely] in a program?
> (In other words, as if the data were within the program not without.)
> Without additional support information, like keeping track of the number
> and lengths of lines.

Most data processing algorithms are (or should be) driven by the
structure of the data that they process; this is normally taught
these days in the "data structures" CS course.  It should be obvious
from the grammar how to structure code that e.g. gets a line
of text, processes it, and writes out the resulting line.  (There
is no need to bring in line numbering or "length of line".)  If
there is no (or only a fuzzy) definition of "line of text",
then it is not obvious how to get/put one, and some random
choice is made by the programmer.  (Which is what started this
discussion.)

For simplicity, I left out of the grammar one important constraint,
which is a limit of no more than 510 characters in a line of text
(exclusive of newline).  I had already stretched the notation a bit
and didn't want to invent yet another notation like { char }*510 .
This limit is actually important in allowing efficient get-line
implementations.

> I think it would be a good example to the young of inheirent complexity.

There is nothing complex about that grammar.  It is a remarkably
simple one, which was the point.  Note that it was decomposed
into meaningful subunits -- this is important!  Just having a
formal grammar (syntax) is not sufficient for good semantic
processing.  (People often forget this.)

> And I thought we were trying to make life simple!  The main problem here
> is that we are trying to impose structure on unstructured data, which
> is probably not the best approach.

Text files certainly are structured, although it's a rather
flexible structure.  One might argue that dividing text into
lines is artificial, but the concept of a "line of text" is
useful in many text-processing programs (e.g., "grep").

> Sentinels are a wonderful way of implementing lists, but a terrible way
> of implementing strings.  Hint, hint.

Oh, foo.  Both the count+data and NUL-terminated representations
for character strings have good and bad points.  I've used both
and prefer C's approach for most routine programming.

If the point of the correspondent was that FILES-11 variable-
length record format is easier to work with, he deserves a
large horse laugh.  See "Software Tools" for examples of the
use of UNIX-like text file formats in programs.