Path: utzoo!attcan!uunet!super!udel!gatech!uflorida!mailrus!cornell!uw-beaver!uw-june!ka From: ka@june.cs.washington.edu (Kenneth Almquist) Newsgroups: comp.unix.wizards Subject: Re: what should egrep '|root' print? (syntax/semantics) Message-ID: <5847@june.cs.washington.edu> Date: 27 Sep 88 09:12:06 GMT References: <44414@beno.seismo.CSS.GOV> <68203@sun.uucp> <8202@alice.UUCP> <1988Sep20.043728.20198@utzoo.uucp> Sender: uucp@super.ORG Organization: U of Washington, Computer Science, Seattle Lines: 55 henry@utzoo.uucp (Henry Spencer) writes: > Well, personally, I'd dearly love to be able to use (| and |) as metasymbols, > since (a) one highly desirable extension to my regexp package would be the > beginning/end-of-identifier metasymbols found in many implementations, > (b) I am deeply opposed to declaring more unbackslashed characters to be > metasymbols, and (c) I am even more deeply opposed to declaring *any* > backslashed characters to be metasymbols. There are other possibilities, > exploiting sequences that are syntax errors at the moment, but none of > them is nearly as pretty. (Not a trivial issue, given that users have to > remember whatever sequence gets chosen.) Alas, I am also sympathetic > to the argument that (1) it would be an unfortunate inconsistency, and > (2) programs that generate regexps might have to go out of their way to > avoid generating these magic sequences. Argh. Any thoughts? My solution (when I faced this problem a long time ago) was to make an asterisk at the start of a regular expression require that the string matched not be preceded or followed by an character which can appear in a word. The arguments pro and con seem to be: 1) Word beginning and ending patterns are more flexible. Can anyone come up with a use for this flexibility? I can't. 2) The asterisk convention is easier to type. 3) The asterisk convention is easy to explain to a beginner on an intuitive level ("Place an asterisk in front of the expression to search for a word"), although a complete explanation of the semantics is about as complicated for either convention. 4) Even after the user learns the word begin and end commands, the user still has to type two commands to get a word search, which increases the cognitive complexity compared to typing one command to get a word search. 5) Neither syntax is intuitively obvious, but (| and |) do have intuitively obvious interpretations (both consist of a parethises and a '|' operator) which differ from the interpretation that Henry suggests for them. The basic problem with the word beginning and ending patterns is that they are at the wrong level. If they are *only* used as building blocks to build word searches, then a higher level feature like the asterisk convention which allows users to request word searches directly is a better choice. And they are too high level to be used for much else besides constructing word searches. The rare cases where they are used for something else (if such cases exist) can be handled by lower level features from which word beginning and ending patterns can be constructed. I expect that Henry's regexp package (like egrep) already has the required features. In conclusion, I believe that including the (| and |) operators in a regular expression package is a poor idea on two grounds. The semantics are wrong; if word searches are desired there are better ways to provide them, such as the asterisk convention. And (| and |) are a lousy choice of operators, for reasons which Henry notes in his article, while the asterisk convention has no such problems. Kenneth Almquist