Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!ncar!tank!uxc!uxc.cso.uiuc.edu!a.cs.uiuc.edu!m.cs.uiuc.edu!wsmith
From: wsmith@m.cs.uiuc.edu
Newsgroups: comp.lang.misc
Subject: Dumb Lexical Analyzers are Smart
Message-ID: <5200026@m.cs.uiuc.edu>
Date: 19 Sep 88 15:32:00 GMT
Lines: 50
Nf-ID: #N:m.cs.uiuc.edu:5200026:000:2170
Nf-From: m.cs.uiuc.edu!wsmith    Sep 19 10:32:00 1988


What are the advantages and disadvantages of designing a language that may
be lexically analyzed with no resort to semantic information?  

Pascal, Prolog, and Smalltalk are such a language because each may be
unambiguously parsed when the lexical value of every token is determined
by a finite-state-machine (i.e. regular-expression).  C is not such
a language because of the typedef construct.  A typedef changes
the lexical class of the new type's identifier to avoid horrendous ambiguity in
the language.  C++ has the same problem, only worse.

What I would like to do in a new language is make the lexical analyzer have
more lexical categories.  For example, TypeIdentifier and variableidentifier
would be lexically different because the first has an initial uppercase letter 
while the second doesn't.  If other categories of identifiers are needed, they
may be defined to be identifiers-with-a-minus-sign or under_scored or
Under_Scored each could belong to a different category if there was a valid
semantic reason to distinguish between them.  (Prolog and Smalltalk already do 
this.)

There are a lot of possiblities and I don't mean to limit myself to only
splitting the lexical classes for identifiers.  Other tokens could also 
have several uses, but identifiers were the easiest to explain.  

Advantages I can see:

	1.  A person familiar with the lexical rules of the language can 
		more easily understand a routine without consulting all of the
		declarations involved.

	2.  The lexical analyzer and parser could be separated into two 
		processes, possibly improving the performance of the compiler 
		on a parallel architecture.

	3.  Post-processors of the language such as a pretty printer or
		cross-reference utility do not need the symbol
		table part of the compiler in order to be written.
	
	4.  It is easier to make a language oriented editor wholly table 
		driven.

Disadvantages I can see:

	1.   An explosion of the number of lexical classes may be too 
		difficult to remember.

	2.   Users may disagree with the language designer's set of lexical
		classes (or be just plain stubborn).

Bill Smith		uiucdcs!wsmith		wsmith@cs.uiuc.edu