Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!cornell!uw-beaver!ssc-vax!cxsea!blm
From: blm@cxsea.UUCP (Brian Matthews)
Newsgroups: comp.editors
Subject: Re: pattern matches
Message-ID: <2424@cxsea.UUCP>
Date: 5 Jul 88 15:51:44 GMT
References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu> <11070@sol.ARPA>
Reply-To: blm@cxsea.UUCP (Brian Matthews)
Organization: Computer X Inc.
Lines: 82

Stuart Friedberg (stuart@cs.rochester.edu) writes:

[ discussion about matching the string abcdxxxghi against the pattern
abc(d.*e)^ghi, where the ^ means the preceding subexpression should match
strings that the subexpression alone wouldn't match ]

|In at least one interpretation, this pattern *does* match the input,
|because you can, in fact, find an "abc", followed by something that
|is not a "(d.*e)", followed by a "ghi".
|
|  abcdxxxghi
|  ^^^		matched by pattern "abc"
|     ^^^^	matched by pattern "(d.*e)^"
|         ^^^	matched by pattern "ghi"
|
|However, the conventional (but certainly not necessary) interpretation
|of regular expressions, at least in the Unix tool world, is that they
|match as much as possible.

Actually, this is a problem with normal (without the added ^) expressions
also.  Consider matching abcdef by abc.*def.  You can have:

    abcdef
    ^^^		matched by pattern "abc"
       ^^^	matched by pattern ".*"
		"def" doesn't match, so match fails.

or:

    abcdef
    ^^^		matched by pattern "abc"
		".*" matches the null string
       ^^^	matched by pattern "def"

In Unix, this ambiguity is overcome by specifying that in addition to
saying that a subexpression should match as much as possible, it must
allow following subexpressions to match as well, meaning that the second
case above is what happens, as well as things like:

    abcdefdef
    ^^^		matched by pattern "abc"
       ^^^	matched by pattern ".*"
          ^^^	matched by pattern "def"

|Under that interpretation, the pattern
|doesn't completely match the input, because the "(d.*e)^" matches
|too much, leaving nothing for the "ghi".
|
|  abcdxxxghi
|  ^^^		matched by pattern "abc"
|     ^^^^^^^	matched by pattern "(d.*e)^"

The additional clause mean the (d.*e)^ matches the longest string it can
while still allowing ghi to match, namely dxxx.

|  abcdxxxghixxxghi
|  ^^^			matched by "abc"
|     ^^^^		matched by "(d.*e)^"
|         ^^^		matched by "ghi"
|
|or, perhaps:
|
|  abcdxxxghixxxghi
|  ^^^			matched by "abc"
|     ^^^^^^^^^^		matched by "(d.*e)^"
|               ^^^	matched by "ghi"

The second case would be correct under the above convention, as
dxxxxghixxx is the longest string (d.*e)^ matches, while still allowing
ghi to match.

|Having said all this, I think that negated patterns could be quite
|useful.

I do also, and have been thinking about how it should be properly done.
A couple of other things I've been thinking about are being able to do a
case insensitive match on a subexpression, and treating a subexpression
as a string instead of a pattern.

-- 
Brian L. Matthews  blm@cxsea.UUCP   ...{mnetor,uw-beaver!ssc-vax}!cxsea!blm
+1 206 251 6811    Computer X Inc. - a division of Motorola New Enterprises