Path: utzoo!utgpu!water!watmath!watdragon!lion!smking
From: smking@lion.waterloo.edu (Scott M. King)
Newsgroups: comp.editors
Subject: Re: pattern matches
Summary: Negative patterns NOT well defined
Keywords: pattern matcher regular expression
Message-ID: <7648@watdragon.waterloo.edu>
Date: 6 Jul 88 20:59:11 GMT
References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu> <11070@sol.ARPA> <18838@cornell.UUCP>
Sender: daemon@watdragon.waterloo.edu
Reply-To: smking@lion.waterloo.edu (Scott M. King)
Organization: U. of Waterloo, Ontario
Lines: 61

In article <18838@cornell.UUCP> blandy@cs.cornell.edu (Jim Blandy) writes:
>I do think negation is well-defined; using the proposed syntax, (pat)^
>matches any string pat would not.  Since the set of strings matched by
>pat is (presumably) well-defined, the set for (pat)^ is too.

Wrong. You have already seen many different interpretations of how (pat)^ 
could be defined, some clearly wrong. It is not obvious what some patterns
(containing negation) should match on some strings. 

A definition for *what* gets matched by a normal regular expression (not
*how* it gets matched) could be stated as follows.
Given an arbitrary pattern and a line to search, there could
be many *possible* matches of the pattern. Ie., there may be many substrings
in the line that could be matched by the pattern. To choose what the ultimate
match is, first determine the first character in the line that is part of a
possible match. Then, discard any possible matches that do not start on this
character. Finally, choose the longest substring of the possible matches
that remain. This is the ultimate match.

Now, what happens if we apply this algorithm to negated patterns?
Take the pattern "(a.*b)^cd", and the line "awxybzcd". Well, the
possible matches are "awxybzcd", "wxybzcd", "xybzcd", "ybzcd", "bzcd", "zcd"
or "cd". The possible match that starts at the earliest character position
is the entire line. This matches because "awxybz" is a possible match
for "(a.*b)^". Ie., it is not "an a followed by anything followed by a b".
Using your wording, (a.*b)^ matches any string (a.*b) would not, and
(a.*b) would clearly not match all of "awxybz", so (a.*b)^ does match it.

I think the most desirable result from the previous example would be
no match. This follows if the definition of negated patterns
is made in terms of the pattern being negated: In (rexp)^, match rexp
as usual. Then, if rexp was found, cause (rexp)^ to fail. Otherwise,
if rexp was not found, cause (rexp)^ to succeed.
So, applying this definition, the pattern matcher would look for a.*b
as usual, and since it finds "awxyb", (a.*b)^ fails, and so this
branch of the state machine dies off. Then, a.*b can not
be found starting at the second character in the line, so (a.*b)^
succeeds starting at "wxybzcd". Back to the question of *what*
this success of (a.*b)^ actually matches. I still maintain that the
only reasonable and consistent definition would cause
such a success to match zero characters. Any other arbitrary rule
defining what non-null set of characters gets matched is sure to
be wrong in many cases. And besides, if the negated pattern matches
zero characters, you can always explicitly say what to match.
Eg: using my definition, I can write the pattern "abc((d*e)^...|(.*b)^d..e)fg",
which means 'match abc. Then, if this is found, look for "d*e". If this fails,
match any three characters. If "(d*e)^..." was not found, then
look for ".*b". If this fails, match "d..e". If anything was matched
after the "abc", then match "fg". Else fail.
From my first example, "(a.*b)^cd", the "look for the "cd", and then test
what comes in between" rule that someone suggested can be represented
with my definition with the pattern "(a.*b)^.*cd".

Anyway, if (rexp)^ only tests for rexp and inverts the result, then
there should also be a pattern construct to test
for a positive pattern. (I previously thought that the "!" in (rexp)!
could mean "test for rexp". A slightly more mnemonic character is "?",
so "(rexp)?" would test for rexp rather that match it.)
--

Scott M. King