Path: utzoo!attcan!utgpu!watmath!att!bellcore!decvax!ima!esegue!compilers-sender From: heirich@cs.ucsd.edu (Alan Heirich) Newsgroups: comp.compilers Subject: Re: Error reporting in LR parsers Message-ID: <1989Aug8.131112.1081@esegue.uucp> Date: 8 Aug 89 13:11:12 GMT References: <1989Aug4.032053.753@esegue.uucp> <1989Aug6.024931.10014@esegue.uucp> Sender: compilers-sender@esegue.uucp Reply-To: heirich@cs.ucsd.edu (Alan Heirich) Organization: EE/CS Dept. U.C. San Diego Lines: 172 Approved: compilers@esegue.segue.bos.ma.us This is the second of two postings. The first posting dealt with objections to automatic error recovery in yacc. This posting describes modifications to DECUS yacc to permit automatic diagnostic generation. In an earlier message I mentioned that I have modified yacc to provide useful diagnostic messages to describe the parse (for debugging a grammar) and to describe the expected tokens when a syntax error occurs (to debug a user). I received many requests for the source code. I have been advised that the version of DECUS yacc which I modified is in fact pirate AT&T code, so I don't think I can redistribute it. But I will describe the mods, which are easy to make. [If you have AT&T source, these changes should be straightforward to apply. -John] The general goal is to encode the information from the parser description file, describing states, goto sets, and lookahead sets, in a compact form that can be accessed at run time. Once you have done this you can use any number of schemes of varying complexity to generate messages describing the parse process and the expectations of the parser. I will leave most of this to the ingenuity of the reader, and simply describe how to get the essential information out of the yacc data structures. The changes are nearly all in the routine "output". This routine writes out the parser description to the description file. You will want to modify it to write out five pieces of information: (1) a set of strings containing token names (2) a set of strings containing nonterminal names (3) a set of states containing items sets (4) a set of states containing lookahead sets (5) a set of states containing goto sets (1) and (2) are easy. The token names are in an array of records. tokset[i].name is the string containing the name of token #i. Use the TLOOP macro on i to print out all the token names, i.e. TLOOP(i) { printf ("\"%s\",\n", tokset [i].name); } Since TLOOP starts at 1, print a null string before you start this loop to account for token #0 (which apparently is nonexistent as far as yacc is concerned). The nonterminal names are in another array of records, nontrst. Use the NTLOOP macro, which starts at 0, as follows: NTLOOP(i) { printf ("\"%s\",\n", nontrst [i].name); } The item sets are useful so that the run time parser can show you a description of the current state, rather than just emit a message "in state x". You guessed it, the SLOOP macro steps through all of the states: SLOOP(i) { writeState (i); } the writeState function (which you must create) is a bit trickier than simply printing out a string. Copy the code from the beginning of output() which writes out the state description to the description file, then selectively delete to get what you want. You should end up with approximately the following: SLOOP(i) { nolook = !(tystate [i] == MUSTLOOKAHEAD); closure(i); nolook = 1; ITMLOOP(i,pp,qq) { writeItem (pp-> pitem); printf ("-99999,\n"); } if (tystate [i] == MUSTLOOKAHEAD) WSLOOP (wsets + (pstate[i+1]-pstate[i]), u) if (*(u->pitem) < 0) { writeItem (u->pitem); printf ("-99999,\n"); } /* if */ What is happening here? Well, we first recompute closure for the state. That sets important flags in the tystate array. We then loop through all the normal items in the state, thanks to ITMLOOP, and write them out (more on this below). Then, if the state requires lookahead we write out another group of items (exactly what they are escapes me at the moment. I think they are items that reduce on the empty string, or something like that). What must writeItem(pp) do? Approximately this (again, adapted from the code that writes out the state descriptions to the description file): writeItem (pp) int *pp; { int *p; for (p = pp; *p > 0; ++p) { } /* (skip ahead until *p > 0) */ p = prdptr [-*p]; printf ("%d,", *p); while (1) { p++; if (p == pp) printf ("-88888,"); if (*p <= 0) break; else printf ("%d,", *p); } /* while */ if (*pp < 0) { printf ("%d,", *pp); } } /* writeItem */ This writes out the elements of a single item, encoded as integers. -88888 represents the location of the . (or underscore) in the item, i.e. the current position of the parse. A positive number is the number of a token or nonterminal name; if the number is > NTBASE (a yacc constant) then it represents a nonterminal (and NTBASE must be subtracted to get the number of the nonterminal); if the number is < NTBASE then it can be used directly as an index into the token name table (described above). A negative number indicates a rule that is the be reduced; it always appears at the end of an item, and is negated (to get the actual rule number, just negate the value printed out). Notice that after writeItem is called, the value -99999 is printed out. This is simply an end-of-item marker. For (4), the lookahead sets: use SLOOP again, and for each state i, compute closure(i). Well, here is the code, including some important details: SLOOP(i) { nolook = !(tystate [i] == MUSTLOOKAHEAD); closure (i); nolook = 1; aryfil (temp1, ntokens + nnonter + 1, 0); WSLOOP (wsets,u) { c = *(u -> pitem); if (c > 1 && c < NTBASE && temp1 [c] == 0) { printf ("%d,", c); temp1 [c] = 1; } /* if */ } /* WSLOOP */ } /* SLOOP */ Finally, (5), the goto sets: SLOOP(i) { nolook = !(tystate [i] == NOLOOKAHEAD); closure (i); nolook = 1; aryfil (temp1, ntokens + nnonter + 1, 0); WSLOOP (wsets, u) { c = *(u -> pitem); if (c > NTBASE && temp1 [(c -= NTBASE) + ntokens] == 0) { temp1 [c + ntokens] = 1; printf ("%d,", c); } /* if */ } /* WSLOOP */ } /* SLOOP */ This basically does it. What you get from all this is a lot of printed information. Write this to a text file and use it to initialize a set of arrays which you compile into your parser. You can then access the information while your parser runs. This allows it to describe states in the same form as in the description file; describe the expected lookahead tokens or goto items (i.e. the expected nonterminals, which are often much more explanatory than the lookahead tokens); and describe the tokens as they are obtained from yylex (very helpful). This method certainly isn't perfect -- the biggest problem is that an end user can be overwhelmed with information, since a programming language like C or Pascal can generate states with 30 valid lookahead tokens. (Using the nonterminal names from the goto sets can help in this situation, although only partially). Still, it's a useful start, and does provide helpful information. Also, be aware that program errors tend to be static semantic errors rather than syntax errors, so the gravity of the situation is somewhat less than one might think... But it is true, what we all *really* want is a son-of-yacc (yayacc?) which allows us to specify textual error messages from within the grammar... and which doesn't generate stupid reduce/reduce conflicts when we put lots of error productions into the grammar! ------------------------- Alan Heirich Comp. Sci. & Eng., Cognitive Science C-014 University of California, San Diego 92093 heirich@cs.ucsd.edu aheirich@ucsd.bitnet -- -- Send compilers articles to compilers@ima.isc.com or, perhaps, Levine@YALE.EDU { decvax | harvard | yale | bbn }!ima. Meta-mail to ima!compilers-request. Please send responses to the author of the message, not the poster.