Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!uw-beaver!ssc-vax!cxsea!blm From: blm@cxsea.UUCP (Brian Matthews) Newsgroups: comp.editors Subject: Re: pattern matches Message-ID: <2424@cxsea.UUCP> Date: 5 Jul 88 15:51:44 GMT References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu> <11070@sol.ARPA> Reply-To: blm@cxsea.UUCP (Brian Matthews) Organization: Computer X Inc. Lines: 82 Stuart Friedberg (stuart@cs.rochester.edu) writes: [ discussion about matching the string abcdxxxghi against the pattern abc(d.*e)^ghi, where the ^ means the preceding subexpression should match strings that the subexpression alone wouldn't match ] |In at least one interpretation, this pattern *does* match the input, |because you can, in fact, find an "abc", followed by something that |is not a "(d.*e)", followed by a "ghi". | | abcdxxxghi | ^^^ matched by pattern "abc" | ^^^^ matched by pattern "(d.*e)^" | ^^^ matched by pattern "ghi" | |However, the conventional (but certainly not necessary) interpretation |of regular expressions, at least in the Unix tool world, is that they |match as much as possible. Actually, this is a problem with normal (without the added ^) expressions also. Consider matching abcdef by abc.*def. You can have: abcdef ^^^ matched by pattern "abc" ^^^ matched by pattern ".*" "def" doesn't match, so match fails. or: abcdef ^^^ matched by pattern "abc" ".*" matches the null string ^^^ matched by pattern "def" In Unix, this ambiguity is overcome by specifying that in addition to saying that a subexpression should match as much as possible, it must allow following subexpressions to match as well, meaning that the second case above is what happens, as well as things like: abcdefdef ^^^ matched by pattern "abc" ^^^ matched by pattern ".*" ^^^ matched by pattern "def" |Under that interpretation, the pattern |doesn't completely match the input, because the "(d.*e)^" matches |too much, leaving nothing for the "ghi". | | abcdxxxghi | ^^^ matched by pattern "abc" | ^^^^^^^ matched by pattern "(d.*e)^" The additional clause mean the (d.*e)^ matches the longest string it can while still allowing ghi to match, namely dxxx. | abcdxxxghixxxghi | ^^^ matched by "abc" | ^^^^ matched by "(d.*e)^" | ^^^ matched by "ghi" | |or, perhaps: | | abcdxxxghixxxghi | ^^^ matched by "abc" | ^^^^^^^^^^ matched by "(d.*e)^" | ^^^ matched by "ghi" The second case would be correct under the above convention, as dxxxxghixxx is the longest string (d.*e)^ matches, while still allowing ghi to match. |Having said all this, I think that negated patterns could be quite |useful. I do also, and have been thinking about how it should be properly done. A couple of other things I've been thinking about are being able to do a case insensitive match on a subexpression, and treating a subexpression as a string instead of a pattern. -- Brian L. Matthews blm@cxsea.UUCP ...{mnetor,uw-beaver!ssc-vax}!cxsea!blm +1 206 251 6811 Computer X Inc. - a division of Motorola New Enterprises