Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!a.cs.uiuc.edu!m.cs.uiuc.edu!wsmith From: wsmith@m.cs.uiuc.edu Newsgroups: comp.lang.misc Subject: Dumb Lexical Analyzers are Smart Message-ID: <5200026@m.cs.uiuc.edu> Date: 19 Sep 88 15:32:00 GMT Lines: 50 Nf-ID: #N:m.cs.uiuc.edu:5200026:000:2170 Nf-From: m.cs.uiuc.edu!wsmith Sep 19 10:32:00 1988 What are the advantages and disadvantages of designing a language that may be lexically analyzed with no resort to semantic information? Pascal, Prolog, and Smalltalk are such a language because each may be unambiguously parsed when the lexical value of every token is determined by a finite-state-machine (i.e. regular-expression). C is not such a language because of the typedef construct. A typedef changes the lexical class of the new type's identifier to avoid horrendous ambiguity in the language. C++ has the same problem, only worse. What I would like to do in a new language is make the lexical analyzer have more lexical categories. For example, TypeIdentifier and variableidentifier would be lexically different because the first has an initial uppercase letter while the second doesn't. If other categories of identifiers are needed, they may be defined to be identifiers-with-a-minus-sign or under_scored or Under_Scored each could belong to a different category if there was a valid semantic reason to distinguish between them. (Prolog and Smalltalk already do this.) There are a lot of possiblities and I don't mean to limit myself to only splitting the lexical classes for identifiers. Other tokens could also have several uses, but identifiers were the easiest to explain. Advantages I can see: 1. A person familiar with the lexical rules of the language can more easily understand a routine without consulting all of the declarations involved. 2. The lexical analyzer and parser could be separated into two processes, possibly improving the performance of the compiler on a parallel architecture. 3. Post-processors of the language such as a pretty printer or cross-reference utility do not need the symbol table part of the compiler in order to be written. 4. It is easier to make a language oriented editor wholly table driven. Disadvantages I can see: 1. An explosion of the number of lexical classes may be too difficult to remember. 2. Users may disagree with the language designer's set of lexical classes (or be just plain stubborn). Bill Smith uiucdcs!wsmith wsmith@cs.uiuc.edu