Path: utzoo!attcan!uunet!lll-winken!lll-tis!ames!elroy!peregrine!ccicpg!nick
From: nick@ccicpg.UUCP (Nick Crossley)
Newsgroups: comp.lang.misc
Subject: Re: Dumb Lexical Analyzers are Smart
Summary: Check out Algol68
Message-ID: <25387@ccicpg.UUCP>
Date: 20 Sep 88 23:31:41 GMT
References: <5200026@m.cs.uiuc.edu>
Reply-To: nick@ccicpg.UUCP (Nick Crossley)
Organization: CCI CPG, Irvine CA
Lines: 65

In article <5200026@m.cs.uiuc.edu> wsmith@m.cs.uiuc.edu writes:
>
>What I would like to do in a new language is make the lexical analyzer have
>more lexical categories.  For example, TypeIdentifier and variableidentifier
>would be lexically different because the first has an initial uppercase letter 
>while the second doesn't.  If other categories of identifiers are needed, they
>may be defined to be identifiers-with-a-minus-sign or under_scored or
>Under_Scored each could belong to a different category if there was a valid
>semantic reason to distinguish between them.  (Prolog and Smalltalk already do 
>this.)
>
>Bill Smith		uiucdcs!wsmith		wsmith@cs.uiuc.edu

I agree it is an good idea in language design to ensure that lexing is
independent of parsing.

Algol68 does this to some extent.  'Keywords' (such as BEGIN/END, etc.) are
individual symbols, which could be represented with a single character if
an implementation had a large enough character set and appropriate keyboards.
In practice, and in the reference language, these symbols are normally
represented using the appropriate sequence of letters, but in a different
alphabet from variable names.  Type and operator identifiers are also spelled
using this different alphabet.

The language does not restrict how these two alphabets are implemented;
conventionally upper and lower case are used, but bold, underlining, quoting,
etc., are all possible.

This does leave one (nasty) ambiguity, between type and operator names.
Consider the fragment :-
	WORD a;
Is this a declaration of a REF WORD a, or is it a monadic formula, with
the monadic operator WORD?  This is very similar to the C typedef problem,
and is usually solved in a similar way.  The lexer initially thinks all
uppercase words are type names, but the parser will change that when it
sees an operator or priority definition.  This often implies an implementation
restriction that a single compilation unit cannot use an uppercase word
as both a type and an operator (in different scopes).  This ambiguity
could be solved by using a third alphabet (say italic) for operators.

Problems with lexing operator tokens, when the user is allowed to define
additional operators, were avoided by splitting all possible operator
characters into two classes, monad and nomad.  Dyadic operators can be
any of the combinations :-
	monad
	nomad
	monad nomad
	nomad nomad
Monadic operators can be either of :-
	monad
	monad nomad
nomad characters are < > / = * x (a times-symbol, not the letter x).
All others are monad.

Note that this avoids all possible lexical ambiguities in building up tokens
from characters; it does not distinguish between a dyadic or monadic operator,
as in Algol68 there is no need to do this at the lexical level.  It does ensure
that the lexing of a sequence of operator characters is fixed and does not
depend on context.  Note that this scheme makes the C operator -- impossible
in Algol68 - the lexer would return two separate tokens, since - is a monad
and there is no operator formed by 'monad monad'.
-- 

<<< standard disclaimers >>>
Nick Crossley, CCI, 9801 Muirlands, Irvine, CA 92718-2521, USA
Tel. (714) 458-7282,  uucp: ...!uunet!ccicpg!nick