Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83 SMI; site emacs.uucp Path: utzoo!linus!decvax!cca!emacs!ray From: ray@emacs.uucp (Ray Reeves) Newsgroups: net.unix Subject: Re: comments in lex Message-ID: <122@emacs.uucp> Date: Thu, 11-Jul-85 14:56:38 EDT Article-I.D.: emacs.122 Posted: Thu Jul 11 14:56:38 1985 Date-Received: Mon, 19-Aug-85 23:10:30 EDT References: <114@emacs.uucp> Reply-To: ray@emacs.UUCP (Ray Reeves) Organization: CCA Uniworks, Wellesley, MA Lines: 66 Summary: Thanks for all the contributions on this subject. Many people recommended embedding C in Lex to solve my problem, pointing out some precedents for this. This, of course, is tantamount to saying that Lex can't hack it, and indeed amdahl!drivax!alan says I shouldn't expect a finite state machine to do so. Paul Haahr of Princeton made a snappy answer which was: "/*"([^*]|"*"[^/]*"*/" but Glen Dudek of Harvard pointed out that this fails for /***/, and it should have been: "/*"("/"|("*"*[^*/]))*"*"+"/" Several people pointed out the hazard of enormous block comments, and McQueer steered me to the use of START transitions, which I decided was sound advice when I discovered the yymore function. I append the Lex code that I have arrived at, a program to which you all contributed. My special problem is a lexical processor for a PL/1 pretty-printer, and in this environment people typically have enormous comment blocks with some sort of pattern or table in them. Thus, although code can be torn to shreds and reformatted block comments must be left undisturbed. The problem of large blocks is solved by entering a "comment" mode and tokenising each line separately. The residual problem is that the first line of such a block has to respect leading white space even before the comment starts. This is solved by tokenising the whole line, not just the comment part. Elsewhere, white space is discarded. To my astonishment, a fault in Lex showed up which almost crippled me. It is impossible to recognise just one \n character under a START mode, although you can in normal mode. Thus, my last rule looks for [\n]+ followed by any character and then unputs that character back. Is this a known wart? startcom \/\* endcom \*\/ %START com maybecom %% {startcom} {yymore();BEGIN com;} [\n]+ {printf("%s%u%c","nl(",yyleng,')');BEGIN maybecom;} [ \t]+ ;[\ \t]*{startcom} {yymore();BEGIN com;} [^\*\n]*{endcom} {printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")");BEGIN 0;} [^\*\n]* printf("%s%u%s%s%s","cm(",yyleng,",\"",yytext,"\")"); [^\*\n]*\* yymore(); [\n]+. {unput(yytext[yyleng-1]);printf("%s%u%c","nl(",yyleng-1,')');} %% main() {while (2) yylex();} -- Ray Reeves, CCA-UNIWORKS,20 William St,Wellesley, Ma. 02181. (617)235-2600 emacs!ray@CCA-UNIX