Path: utzoo!attcan!uunet!lll-winken!netsys!lamc!well!shf From: shf@well.UUCP (Stuart H. Ferguson) Newsgroups: comp.sys.amiga.tech Subject: Re: New IFF format details (long). Summary: Lots of stuff about IFF in general (also long). Keywords: IFF, standard, parsing, context-free grammar Message-ID: <7167@well.UUCP> Date: 21 Sep 88 07:24:03 GMT References: <3450@crash.cts.com> Reply-To: shf@well.UUCP (Stuart H. Ferguson) Organization: The Blue Planet Lines: 366 Wade Bickel describes what he sees as problems with the current IFF standard and advantages of going with a different structure. | The problem with the current IFF is that it is not generic. The jist of Wade's argument here is that the types of chunks allowed under IFF 85 are limited and should be made more general by making the rules simpler. With IFF 85, chunks are defined as a four-byte identifer plus a longword byte count followed by that many bytes of data (plus an optional pad byte if the length is odd). This can be represented as ID { data } where ID is a four-character identifer (conforming to certain rules, like no leading or embedded spaces, etc.) and the { data } construct gets replaced by "# data [0]", where "#" is (long)sizeof(data) and [0] is the optional pad byte. This is all well known and Wade doesn't want to change this signifcantly except to add a status word containing bit flags and a checksum at the end of each chunk. 'Data' can be any block of bytes, but in particular for IFF 85, if the ID of the chunk is "FORM," "LIST," "CAT ," or "PROP," then 'data' is defined as another four-character identifer followed by a series of *chunks*. Thus the format is recursively defined. An IFF file is a single FORM, LIST or CAT chunk. For those into grammars: IFF File ::= FORM | LIST | CAT FORM ::= "FORM" { ID Chunk* } CAT ::= "CAT " { ID Chunk* } LIST ::= "LIST" { ID PROP* Chunk* } PROP ::= "PROP" { ID LocalChunk* } Chunk ::= FORM | LIST | CAT | LocalChunk LocalChunk ::= ID {} ID ::= Wade suggests that this structure is limiting because you can only specify groups of chunks ("Chunk*") within a grouping chunk, namely one of type FORM, LIST or CAT. He wants a simpler grammar, one which allows nested chunks within any chunk: File ::= Chunk Chunk ::= ID { | Chunk* } ID ::= Wade gives the example of an ILBM. An IFF 85 ILBM looks like: "FORM" { "ILBM" "BMHD" { bitmap header } "CMAP" { color map } "BODY" { bitplane data } } Wade's proposed format looks like this: "FORM" { "ILBM" { "BMHD" { bitmap header } "CMAP" { color map } "BODY" { bitplane data } } } (I don't know why he retained the "FORM" identifer. It seems redundant.) Since the grammar description makes this appear to be a simplification of the IFF standard, one which must have been obvious to it's designers, the question arises, how come IFF isn't like this to begin with? Why did the developers of the IFF 85 standard do it the way they did rather than this apparently simpler way? The answer is not trivial and trips at times into the vague and bizarre, but I find that if I try to rectify some of the diffculties that result from using Wade's "new" design, I find that I end up re-inventing IFF 85. The driving consideration here is that we want to be able to extract from a file whatever type of data we may be interested in. For example, consider the case of the ANIM format. The first frame of an ANIM is stored intrnally as an ILBM, like so: "FORM" { "ANIM" "FORM" { "ILBM" "BMHD" { ... } ... rest of the ILBM ... } ... rest of the ANIM ... } Now suppose you have a paint program which doesn't have any understanding of the ANIM format -- that is, it does not recognize the formtype ANIM nor any of its internal chunks. Such a program, if properly written, can still retrieve the first frame of the ANIM as an ILBM by parsing the standard part of the IFF grammar. To the paint program, the ANIM file looks like this: "FORM" { xxxx "FORM" { "ILBM" "BMHD" { ... } ... etc. ... } ... xxxx ... } where "xxxx" represents parts of the file not recognized by the paint program parser. In contrast, Wade's style of ANIM would look like this: "ANIM" { "ILBM" { "BMHD" { ... } ... etc. ... } ... } And the ILBM-understanding paint program reader would see this: xxxx { xxxx } In other words, a parser which didn't understand the ANIM identifer would not be able to look inside the ANIM chunk, because it cannot know whether the chunk contains more chunks or just data. A grammar like this is said to be "context-sensitive" and is undesireable for obvoius reasons. A way to make the grammar "context-free" would be to add a bit to the status word (now part of Wade's file format) to flag whether this chunk has sub-chunks or not. That way, if a reader doesn't understand the ANIM identifier, it can still look inside this chunk for other chunks that it may understand. But what we've just done is to distinguish between grouping and non-grouping chunks, just like IFF 85 does in distinguishing between the FORM, LIST, CAT and PROP id's and all other chunk ID's. One can argue that Wade's method would be more general, since now any chunk can be a grouping chunk. There is certainly a danger of someone setting the grouping bit inconsistently with the nature of the chunk. This danger does not exist with IFF 85 where the identifier is either FORM, LIST, CAT or PROP or else it's not a grouping chunk. So, having provided a mechanism for parsers to examine the internals of Wade's format files without understanding the constituent identifiers, the next problem is that of scope and context. Our ILBM seeking paint program might delve deeply into the structure of some unknown file and locate an ILBM someplace deep inside: xxxx { xxxx { } xxxx { } xxxx { xxxx { } "ILBM" { ... (rest of file unparsed) ... The dificulty here is that the ILBM seeker has no way of determining anything about the context of the ILBM it has found. It can't know, for example, if some other part of the file is modifying this ILBM in any way. It also cannot be certain that the chunk "ILBM" isn't a specific internal chunk for another chunk. The end result of this is that all chunk identifiers must be treated the same wherever they may occur. While this might have some good side-effects, this is a generally undesirable condition primarily because of the inevitable collisions that can occur when many concurrent developers use a flat name-space, especially one as limited as four-character IDs. To alleviate this problem, lets add another bit to the status word that indicates whether this chunk is a root chunk -- that is, if this chunk can stand on its own independant of its context. Chunks without this bit set would be dependent on their context and can not be considered independently. So if the internal ILBM chunk located above had this bit set, then it would be safe to read it as its own bitmap image. If you've been astute, you'll see that I've just re-inveted "formtypes" from the IFF 85 standard. IFF provides a two-tier name-space that effectivly elliminates the possibility of collision: formtypes and local chunk types that depend on their formtype. So, for example, the meaning of a CMAP chunk within a FORM with type ILBM is different from a CMAP within a formtype DRAW, or any other formtype for that matter. The common name-space is the that of the formtypes, and there are many fewer formtypes than there are chunk types. Also, the use of the formtype ID within the FORM chunk gives the same result as the "root" bit in the hypothetical Bickel file format. When a parser sees a FORM, it knows that this is a self-contained object being used as part of a larger structure that it doesn't have to understand. Like the ILBM form within the ANIM form. This hypothetical extension of Wade's format is formally equivalent to the IFF 85 standard -- that is, anything you can do with one can be done with the other. * Gasp! * Does that mean that IFF *doesn't* need to be changed? Yup, that's exactly what that means. As an example, consider Wade's example of a B-tree structure being encoded into his proposed format: "FORM" { "23BT" { "NODE" { "NDAT" "NODE" { data and 3 node chunks } "NODE" { data and 3 node chunks } "NODE" { data and 3 node chunks } } } } This could be encoded into IFF as: "FORM" { "23BT" "DATA" { data for this node } "FORM" { "23BT" "DATA" { more node data } "FORM" { another "23BT" FORM } } "FORM" { another "23BT" FORM } "FORM" { another "23BT" FORM } } But the crux of Wade's proposal is this: | What I really want to do is create a purely Data driven mechanism, as | opposed to the Code driven one in the current IFF. Rather than having to | write code to handle each type of occurance, a structure would be initialized | at run time, and this would be passed to the Reader or Writer parser to be | handled. In this way it would never be necessary to update the Library(s). He's primarily interested in the mechanism for reading/writing such files, and writes about it to great length in his article. If this mechanism were useful for reading and writing "IFF 88" files, then it would be equally applicable to existing IFF files just by changing the file grammar slightly, as I did above. | Like its' predecessor, IFF '88 is a recursive descendant | parser design. The primary differences between the old design and | the new one is that while IFF '85 was code driven, IFF '88 is data driven. I don't get this. First of all, there is nothing inheriently recursive- descent about IFF. In fact, the iff.library that Leo and I are developing is a finite state machine design, rather than recursive- descent. (First attempts were recursive-descent because they are easier to write, but this makes the client code messy ... anyway...) Also, Wade has used the phrase "data-driven" several times, but I don't see what he's talking about. | The Writer Mechanism | ------------------------ What Wade describes here is basically a tree-like data structure to control writing an IFF file. The tree would presumably be traversed by the writer library code and user functions would be called to write the actual bytes of data. Wade refers to this as "data driven." | Whereas IFF '85 reader/writers' require re-compilation of the | source to accomodate format updates, IFF '88 will not. But I don't get it. Sure, the data structure controls the chunk nesting, but the actual business of writing bytes gets handled by user code, so where's the extendability? I still have to have the code to write the chunks in my program which means re-compilation when something changes. | In order to write a file an implementation first creates and | properly initializes a writer-structure, then calls the writer function | which parses the structure and writes the file. Wade details an example of writing an ILBM using his mechanism. The client program builds a data structure representing the structure of the file to be written which includes function pointers for wrting each chunk. | {WLevel} | {WEntry} | ckID = "FORM"; | WLev --------> {WLevel} | Next = NIL; {WEntry} | ckID = "ILBM"; | WLev -----> {WLevel} | Next = NIL; {WEntry} | ckID = "BMHD"; | WrtAlg = ADR(WriteBytes()); | WrtData ---> {WrtAlgParams} | Next ADR(BitMapHdr); | | TSIZE(BitMapHdr); ... etc. ... However, since the structure of the file and the code to write the actual bytes are both provided by the client program, I fail to see how creating this structure and passing it to a generic writer is any different from just having the following piece of code in the client program: /* WriteILBM: bitmap, colormap */ PushChunk (iff, ID_FORM, ID_ILBM); PushChunk (iff, ID_BMHD, 0); WriteBMHD (bitmap); PopChunk (iff); PushChunk (iff, ID_CMAP, 0); WriteCMAP (colormap); PopChunk (iff); PushChunk (iff, ID_BODY, 0); WriteBODY (bitmap); PopChunk (iff); PopChunk (iff); (This is an actual example of the use of the iff.library. RSN!) It seems that for either method I need to have the writing code in my program, and I need to know the structure of the file I want. If anything, I would think that constructing a large tree data structure would be more difficult than just having code to write the file directly. What's the advantage? I'm genuinly interested in this *mechanism* for reading and writing files since it should work equally well for real IFF files. If there are real advantages to this, Wade, I missed them. Could you provide an example of a file format changing and user programs not needing to be recompiled? On separate issue, Wade talks about "dirty" chunks. | [Another problem with IFF] | is that there are no dirty chunk provisions. I feel | that dirty chunk tracking would be a valuable option. Dirty chunks would | occur when, after finding some recognized chunks, unrecognized chunks are | encountered. IFF '85 discards these chunks. I propose that as a user option | unrecognized chunks be retained when a program modifies a partially understood | IFF '88 file. [ ... ] | When unrecognized chunks are written they're marked as dirty, | and any chunks which have been modified are also noted. This issue is discussed somewhat in the EA IFF 85 specification on page B-31 of the Exec RKM. Their conclusion is that the data universe encompassed by IFF is too large to allow for standardization of the possible interactions of the various types of data envolved. While this is an interesting and valid idea, it really makes life miserable for programmers. They have to retain chunks they don't need, don't understand and can't use, just so they can write them out again trying to preserve the original IFF file as much as possible only to fail much of the time. It also means that all programs need to fully support standard chunks so that standard chunks will never be marked as dirty. It also means that programs that use "non-standard" chunks need to make some intelligent decisions about whether a chunk marked as dirty" is good within the context of a specific file. It might be possible, but it could also be a real headache. I'm just not convinced that the advantages are great enough to want to provide such a mechanism. This facility can be provided for any new IFF formtypes, however, by equiping them with a "MAP" chunk (or some such, but it should be consistent across FORMs) which contains a list of the chunks in the file and their status. It is not possible or even desireable to retro- fit this capability into existing formats. -- Stuart Ferguson (shf@well.UUCP) Action by HAVOC (shf@Solar.Stanford.EDU)