Path: utzoo!utgpu!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!ames!pasteur!helios.ee.lbl.gov!sequoia.ee.lbl.gov!vern From: vern@sequoia.ee.lbl.gov (Vern Paxson) Newsgroups: comp.sources.d Subject: Re: flex/lex and '\0' input Keywords: lexical scanners, null Message-ID: <1338@helios.ee.lbl.gov> Date: 30 Nov 88 20:52:33 GMT References: <1047@naucse.UUCP> Sender: usenet@helios.ee.lbl.gov Reply-To: vern@sequoia.ee.lbl.gov (Vern Paxson) Organization: Lawrence Berkeley Laboratory, Berkeley Lines: 47 In article <1047@naucse.UUCP> jdc@naucse.UUCP (John Campbell) writes: >I spent a long time today learning that flex and lex won't deal >with '\0' input.... >.... >If you have a thought on how to change flex (I don't have a source >license to lex) so that it can handle '\0', I'd love to know. If you >have a rationale regarding the current behavior I'd also like to know. Rationale: there are two reasons why flex can't deal with nulls in its input. The first is historical: flex was original a Ratfor programming running under Software Tools, and that combination made nulls problematic. The second is performance: for fast scanning you want to eliminate the check for "are we at the end of the current input buffer" from the inner loop. The way flex does this is to mark the end of the input buffer with a null, and then each DFA state has a transition on null into an accepting state which reads in the next buffer's worth of data and restarts the scan. This method requires that one character value be preempted to serve as the end-of-buffer marker. Null was chosen because it's already burdened with an extra meaning as a C end-of-string, making it in general difficult to treat properly. Fixing it: it could be done but with a fair amount of work. The problem is that the internals of flex are sloppy and assume that 0 can be used for marking unset values. Finding and eliminating these would be tedious since they aren't apparent unless you inspect the code line-by-line. The scanner skeleton already detects real nulls versus fake end-of-buffer ones, but it does so after it has already accepted an input pattern. Continuing the state machine where it left off requires enough bookkeeping that it's a slow process compared with inner-loop scanning, so if you have alot of nulls in your input, the performance degradation would probably be comparable to preprocessing the input to directly remove the nulls. I hope you didn't lose too much time discovering this deficiency - it's documented in the flex manual entry. >I solved the problem, BTW, by replacing the input routine to lex >and "squeezing" out any '\0's that appear. This means, however, >that I had to scan the input one extra time before letting the >scanner do it's job. Yep, that's pretty much what you have to do. Sorry. Vern Vern Paxson vern@lbl-csam.arpa Real Time Systems ucbvax!lbl-csam.arpa!vern Lawrence Berkeley Laboratory (415) 486-6411