Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!elroy!peregrine!ccicpg!nick From: nick@ccicpg.UUCP (Nick Crossley) Newsgroups: comp.lang.misc Subject: Re: Dumb Lexical Analyzers are Smart Summary: Check out Algol68 Message-ID: <25387@ccicpg.UUCP> Date: 20 Sep 88 23:31:41 GMT References: <5200026@m.cs.uiuc.edu> Reply-To: nick@ccicpg.UUCP (Nick Crossley) Organization: CCI CPG, Irvine CA Lines: 65 In article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes: > >What I would like to do in a new language is make the lexical analyzer have >more lexical categories. For example, TypeIdentifier and variableidentifier >would be lexically different because the first has an initial uppercase letter >while the second doesn't. If other categories of identifiers are needed, they >may be defined to be identifiers-with-a-minus-sign or under_scored or >Under_Scored each could belong to a different category if there was a valid >semantic reason to distinguish between them. (Prolog and Smalltalk already do >this.) > >Bill Smith uiucdcs!wsmith wsmith@cs.uiuc.edu I agree it is an good idea in language design to ensure that lexing is independent of parsing. Algol68 does this to some extent. 'Keywords' (such as BEGIN/END, etc.) are individual symbols, which could be represented with a single character if an implementation had a large enough character set and appropriate keyboards. In practice, and in the reference language, these symbols are normally represented using the appropriate sequence of letters, but in a different alphabet from variable names. Type and operator identifiers are also spelled using this different alphabet. The language does not restrict how these two alphabets are implemented; conventionally upper and lower case are used, but bold, underlining, quoting, etc., are all possible. This does leave one (nasty) ambiguity, between type and operator names. Consider the fragment :- WORD a; Is this a declaration of a REF WORD a, or is it a monadic formula, with the monadic operator WORD? This is very similar to the C typedef problem, and is usually solved in a similar way. The lexer initially thinks all uppercase words are type names, but the parser will change that when it sees an operator or priority definition. This often implies an implementation restriction that a single compilation unit cannot use an uppercase word as both a type and an operator (in different scopes). This ambiguity could be solved by using a third alphabet (say italic) for operators. Problems with lexing operator tokens, when the user is allowed to define additional operators, were avoided by splitting all possible operator characters into two classes, monad and nomad. Dyadic operators can be any of the combinations :- monad nomad monad nomad nomad nomad Monadic operators can be either of :- monad monad nomad nomad characters are < > / = * x (a times-symbol, not the letter x). All others are monad. Note that this avoids all possible lexical ambiguities in building up tokens from characters; it does not distinguish between a dyadic or monadic operator, as in Algol68 there is no need to do this at the lexical level. It does ensure that the lexing of a sequence of operator characters is fixed and does not depend on context. Note that this scheme makes the C operator -- impossible in Algol68 - the lexer would return two separate tokens, since - is a monad and there is no operator formed by 'monad monad'. -- <<< standard disclaimers >>> Nick Crossley, CCI, 9801 Muirlands, Irvine, CA 92718-2521, USA Tel. (714) 458-7282, uucp: ...!uunet!ccicpg!nick