Path: utzoo!attcan!uunet!lll-winken!netsys!lamc!well!shf
From: shf@well.UUCP (Stuart H. Ferguson)
Newsgroups: comp.sys.amiga.tech
Subject: Re: New IFF format details (long).
Summary: Lots of stuff about IFF in general (also long).
Keywords: IFF, standard, parsing, context-free grammar
Message-ID: <7167@well.UUCP>
Date: 21 Sep 88 07:24:03 GMT
References: <3450@crash.cts.com>
Reply-To: shf@well.UUCP (Stuart H. Ferguson)
Organization: The Blue Planet
Lines: 366


Wade Bickel describes what he sees as problems with the current IFF 
standard and advantages of going with a different structure.

|         The problem with the current IFF is that it is not generic.

The jist of Wade's argument here is that the types of chunks allowed
under IFF 85 are limited and should be made more general by making the
rules simpler.  With IFF 85, chunks are defined as a four-byte identifer
plus a longword byte count followed by that many bytes of data (plus an
optional pad byte if the length is odd).  This can be represented as

	ID { data }

where ID is a four-character identifer (conforming to certain rules, 
like no leading or embedded spaces, etc.) and the { data } construct 
gets replaced by "# data [0]", where "#" is (long)sizeof(data) and [0]
is the optional pad byte.  This is all well known and Wade doesn't want 
to change this signifcantly except to add a status word containing bit
flags and a checksum at the end of each chunk.

'Data' can be any block of bytes, but in particular for IFF 85, if the
ID of the chunk is "FORM," "LIST," "CAT ," or "PROP," then 'data' is
defined as another four-character identifer followed by a series of
*chunks*.  Thus the format is recursively defined.  An IFF file is a
single FORM, LIST or CAT chunk. 

For those into grammars:

	IFF File   ::=  FORM | LIST | CAT
	FORM       ::=  "FORM" { ID Chunk* }
	CAT        ::=  "CAT " { ID Chunk* }
	LIST       ::=  "LIST" { ID PROP* Chunk* }
	PROP       ::=  "PROP" { ID LocalChunk* }
	Chunk      ::=  FORM | LIST | CAT | LocalChunk
	LocalChunk ::=  ID {  }
	ID         ::=  

Wade suggests that this structure is limiting because you can only
specify groups of chunks ("Chunk*") within a grouping chunk, namely one
of type FORM, LIST or CAT. He wants a simpler grammar, one which allows
nested chunks within any chunk:

	File   ::=  Chunk
	Chunk  ::=  ID {  | Chunk* }
	ID     ::=  

Wade gives the example of an ILBM.  An IFF 85 ILBM looks like:

	"FORM" {
		"ILBM"
		"BMHD" { bitmap header	}
		"CMAP" { color map	}
		"BODY" { bitplane data	}
	}

Wade's proposed format looks like this:

	"FORM" {
		"ILBM" {
			"BMHD" { bitmap header	}
			"CMAP" { color map	}
			"BODY" { bitplane data	}
		}
	}

(I don't know why he retained the "FORM" identifer.  It seems 
redundant.)

Since the grammar description makes this appear to be a simplification
of the IFF standard, one which must have been obvious to it's designers,
the question arises, how come IFF isn't like this to begin with?  Why
did the developers of the IFF 85 standard do it the way they did rather
than this apparently simpler way? 

The answer is not trivial and trips at times into the vague and bizarre, 
but I find that if I try to rectify some of the diffculties that result
from using Wade's "new" design, I find that I end up re-inventing IFF
85.  The driving consideration here is that we want to be able to
extract from a file whatever type of data we may be interested in.  For
example, consider the case of the ANIM format.  The first frame of an
ANIM is stored intrnally as an ILBM, like so:

	"FORM" {
		"ANIM"
		"FORM" {
			"ILBM"
			"BMHD" { ... }
			... rest of the ILBM ...
		}
		... rest of the ANIM ...
	}

Now suppose you have a paint program which doesn't have any 
understanding of the ANIM format -- that is, it does not recognize the 
formtype ANIM nor any of its internal chunks.  Such a program, if 
properly written, can still retrieve the first frame of the ANIM as an
ILBM by parsing the standard part of the IFF grammar.  To the paint 
program, the ANIM file looks like this:

	"FORM" {
		xxxx
		"FORM" {
			"ILBM"
			"BMHD" { ... }
			... etc. ...
		}
		... xxxx ...
	}

where "xxxx" represents parts of the file not recognized by the paint 
program parser.  In contrast, Wade's style of ANIM would look like this:

	"ANIM" {
		"ILBM" {
			"BMHD" { ... }
			... etc. ...
		}
		...
	}

And the ILBM-understanding paint program reader would see this:

	xxxx { xxxx }

In other words, a parser which didn't understand the ANIM identifer 
would not be able to look inside the ANIM chunk, because it cannot know 
whether the chunk contains more chunks or just data.  A grammar like this
is said to be "context-sensitive" and is undesireable for obvoius
reasons.  A way to make the grammar "context-free" would be to add a bit
to the status word (now part of Wade's file format) to flag whether this 
chunk has sub-chunks or not.  That way, if a reader doesn't understand
the ANIM identifier, it can still look inside this chunk for other
chunks that it may understand. 

But what we've just done is to distinguish between grouping and
non-grouping chunks, just like IFF 85 does in distinguishing between the 
FORM, LIST, CAT and PROP id's and all other chunk ID's.  One can argue
that Wade's method would be more general, since now any chunk can be a
grouping chunk.  There is certainly a danger of someone setting the
grouping bit inconsistently with the nature of the chunk.  This danger
does not exist with IFF 85 where the identifier is either FORM, LIST,
CAT or PROP or else it's not a grouping chunk. 

So, having provided a mechanism for parsers to examine the internals of 
Wade's format files without understanding the constituent identifiers, 
the next problem is that of scope and context.  Our ILBM seeking paint 
program might delve deeply into the structure of some unknown file and 
locate an ILBM someplace deep inside:

	xxxx {
		xxxx { }
		xxxx { }
		xxxx {
			xxxx { }
			"ILBM" {

		... (rest of file unparsed) ...

The dificulty here is that the ILBM seeker has no way of determining 
anything about the context of the ILBM it has found.  It can't know, for
example, if some other part of the file is modifying this ILBM in any
way.  It also cannot be certain that the chunk "ILBM" isn't a specific 
internal chunk for another chunk.  The end result of this is that all
chunk identifiers must be treated the same wherever they may occur. 
While this might have some good side-effects, this is a generally
undesirable condition primarily because of the inevitable collisions
that can occur when many concurrent developers use a flat name-space,
especially one as limited as four-character IDs.

To alleviate this problem, lets add another bit to the status word that
indicates whether this chunk is a root chunk -- that is, if this chunk
can stand on its own independant of its context.  Chunks without this
bit set would be dependent on their context and can not be considered
independently.  So if the internal ILBM chunk located above had this bit 
set, then it would be safe to read it as its own bitmap image.

If you've been astute, you'll see that I've just re-inveted "formtypes" 
from the IFF 85 standard.  IFF provides a two-tier name-space 
that effectivly elliminates the possibility of collision: formtypes and
local chunk types that depend on their formtype.  So, for example, the 
meaning of a CMAP chunk within a FORM with type ILBM is different from a
CMAP within a formtype DRAW, or any other formtype for that matter.  The
common name-space is the that of the formtypes, and there are many fewer
formtypes than there are chunk types.  Also, the use of the formtype ID
within the FORM chunk gives the same result as the "root" bit in the
hypothetical Bickel file format.  When a parser sees a FORM, it knows
that this is a self-contained object being used as part of a larger
structure that it doesn't have to understand.  Like the ILBM form within 
the ANIM form.

This hypothetical extension of Wade's format is formally equivalent to
the IFF 85 standard -- that is, anything you can do with one can be done
with the other.  * Gasp! *  Does that mean that IFF *doesn't* need to be
changed?  Yup, that's exactly what that means.  As an example, consider
Wade's example of a B-tree structure being encoded into his proposed
format: 
 
	"FORM" {
		"23BT" {
			"NODE" {
				"NDAT"
				"NODE" { data and 3 node chunks }
				"NODE" { data and 3 node chunks }
				"NODE" { data and 3 node chunks }
			}
		}
	}

This could be encoded into IFF as:

	"FORM" {
		"23BT"
		"DATA" { data for this node }
		"FORM" {
			"23BT"
			"DATA" { more node data }
			"FORM" { another "23BT" FORM }
		}
		"FORM" { another "23BT" FORM }
		"FORM" { another "23BT" FORM }
	}

But the crux of Wade's proposal is this:

|         What I really want to do is create a purely Data driven mechanism, as
| opposed to the Code driven one in the current IFF.  Rather than having to 
| write code to handle each type of occurance, a structure would be initialized
| at run time, and this would be passed to the Reader or Writer parser to be
| handled.  In this way it would never be necessary to update the Library(s).

He's primarily interested in the mechanism for reading/writing such
files, and writes about it to great length in his article.  If this
mechanism were useful for reading and writing "IFF 88" files, then it
would be equally applicable to existing IFF files just by changing the
file grammar slightly, as I did above.

|         Like its' predecessor, IFF '88 is a recursive descendant
| parser design.  The primary differences between the old design and
| the new one is that while IFF '85 was code driven, IFF '88 is data driven.

I don't get this.  First of all, there is nothing inheriently recursive-
descent about IFF.  In fact, the iff.library that Leo and I are
developing is a finite state machine design, rather than recursive-
descent.  (First attempts were recursive-descent because they are easier
to write, but this makes the client code messy ... anyway...)  Also,
Wade has used the phrase "data-driven" several times, but I don't see
what he's talking about. 

|                         The Writer Mechanism
|                       ------------------------

What Wade describes here is basically a tree-like data structure to
control writing an IFF file.  The tree would presumably be traversed by
the writer library code and user functions would be called to write the
actual bytes of data.  Wade refers to this as "data driven."

| Whereas IFF '85 reader/writers' require re-compilation of the
| source to accomodate format updates, IFF '88 will not.

But I don't get it.  Sure, the data structure controls the chunk
nesting, but the actual business of writing bytes gets handled by user
code, so where's the extendability?  I still have to have the code to
write the chunks in my program which means re-compilation when something
changes.

|         In order to write a file an implementation first creates and
| properly initializes a writer-structure, then calls the writer function
| which parses the structure and writes the file.

Wade details an example of writing an ILBM using his mechanism.  The
client program builds a data structure representing the structure of the
file to be written which includes function pointers for wrting each
chunk.  

|    {WLevel}
|     {WEntry}
|       ckID = "FORM";
|       WLev --------> {WLevel}
|      Next = NIL;      {WEntry}
|                         ckID = "ILBM";
|                         WLev -----> {WLevel}
|                        Next = NIL;   {WEntry}
|                                        ckID = "BMHD";
|                                        WrtAlg = ADR(WriteBytes());
|                                        WrtData ---> {WrtAlgParams}
|                                       Next            ADR(BitMapHdr);
|                                         |             TSIZE(BitMapHdr);
					... etc. ...

However, since the structure of the file and the code to write the
actual bytes are both provided by the client program, I fail to see how
creating this structure and passing it to a generic writer is any
different from just having the following piece of code in the client
program: 

	/* WriteILBM: bitmap, colormap */

	PushChunk (iff, ID_FORM, ID_ILBM);

		PushChunk (iff, ID_BMHD, 0);
		WriteBMHD (bitmap);
		PopChunk (iff);

		PushChunk (iff, ID_CMAP, 0);
		WriteCMAP (colormap);
		PopChunk (iff);

		PushChunk (iff, ID_BODY, 0);
		WriteBODY (bitmap);
		PopChunk (iff);

	PopChunk (iff);

(This is an actual example of the use of the iff.library.  RSN!)  It
seems that for either method I need to have the writing code in my
program, and I need to know the structure of the file I want.  If
anything, I would think that constructing a large tree data structure
would be more difficult than just having code to write the file
directly.  What's the advantage?

I'm genuinly interested in this *mechanism* for reading and writing
files since it should work equally well for real IFF files.  If there
are real advantages to this, Wade, I missed them.  Could you provide an
example of a file format changing and user programs not needing to be 
recompiled?


On separate issue, Wade talks about "dirty" chunks.

|  [Another problem with IFF]
|   is that there are no dirty chunk provisions.  I feel
| that dirty chunk tracking would be a valuable option.  Dirty chunks would 
| occur when, after finding some recognized chunks, unrecognized chunks are
| encountered.  IFF '85 discards these chunks.  I propose that as a user option
| unrecognized chunks be retained when a program modifies a partially understood
| IFF '88 file.
 [ ... ]
| When unrecognized chunks are written they're marked as dirty,
| and any chunks which have been modified are also noted.

This issue is discussed somewhat in the EA IFF 85 specification on page
B-31 of the Exec RKM.  Their conclusion is that the data universe
encompassed by IFF is too large to allow for standardization of the
possible interactions of the various types of data envolved.

While this is an interesting and valid idea, it really makes life 
miserable for programmers.  They have to retain chunks they don't need, 
don't understand and can't use, just so they can write them out again
trying to preserve the original IFF file as much as possible only to
fail much of the time.  It also means that all programs need to fully 
support standard chunks so that standard chunks will never be marked as 
dirty.  It also means that programs that use "non-standard" chunks need
to make some intelligent decisions about whether a chunk marked as 
dirty" is good within the context of a specific file.  It might be
possible, but it could also be a real headache.  I'm just not convinced 
that the advantages are great enough to want to provide such a
mechanism.

This facility can be provided for any new IFF formtypes, however, by 
equiping them with a "MAP" chunk (or some such, but it should be
consistent across FORMs) which contains a list of the chunks in the
file and their status.  It is not possible or even desireable to retro-
fit this capability into existing formats. 
-- 
		Stuart Ferguson		(shf@well.UUCP)
		Action by HAVOC		(shf@Solar.Stanford.EDU)