Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!rutgers!rochester!stuart
From: stuart@cs.rochester.edu (Stuart Friedberg)
Newsgroups: comp.editors
Subject: Re: pattern matches
Summary: problems with negated patterns and a suggestion
Message-ID: <11070@sol.ARPA>
Date: 5 Jul 88 03:37:38 GMT
References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu>
Organization: U of Rochester, CS Dept., Rochester, NY
Lines: 74

In article <7618@watdragon.waterloo.edu> Scott M. King writes:
>If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then
>what gets matched, if anything? The abc part is easy. Then, the pattern
>matcher tries to match a d.*e but it doesn't find one.
>That means the (d.*e)^ is successful, but since it didn't find anything,
>it has to start looking for the ghi at "dxxxghi", and doesn't find it.
>Thus the match fails.

Perhaps the original suggestion wasn't absolutely clear.  I can't
decide between a couple of interpretations, but I don't agree with King
when he says "That means the (d.*e)^ is successful, but since it didn't
find anything, ..."  The pattern DID find something.  It found a string
that was not a 'd' followed by anything followed by an 'e'.  The hard
part is deciding just *which* string it matched.

In at least one interpretation, this pattern *does* match the input,
because you can, in fact, find an "abc", followed by something that
is not a "(d.*e)", followed by a "ghi".

  abcdxxxghi
  ^^^		matched by pattern "abc"
     ^^^^	matched by pattern "(d.*e)^"
         ^^^	matched by pattern "ghi"

However, the conventional (but certainly not necessary) interpretation
of regular expressions, at least in the Unix tool world, is that they
match as much as possible.  Under that interpretation, the pattern
doesn't completely match the input, because the "(d.*e)^" matches
too much, leaving nothing for the "ghi".

  abcdxxxghi
  ^^^		matched by pattern "abc"
     ^^^^^^^	matched by pattern "(d.*e)^"

This is a problem.  Accepting anything *but* a pattern in almost all
cases accepts too much, under the usual interpretation.  Negated
patterns should *not* match the longest possible string for the reason
given above, but neither should they match the shortest possible
string, because "(d.*e)^" would match any two character string not
equal to "de".

Well, we could treat negated patterns "(d.*e)^" by looking for the next
positive pattern "ghi", then testing the stuff we skip over for the
undesirable pattern.  That seems reasonable at first, but what do we
do for input like "abcdxxxghixxxghi"?  Do we have:

  abcdxxxghixxxghi
  ^^^			matched by "abc"
     ^^^^		matched by "(d.*e)^"
         ^^^		matched by "ghi"

or, perhaps:

  abcdxxxghixxxghi
  ^^^			matched by "abc"
     ^^^^^^^^^^		matched by "(d.*e)^"
               ^^^	matched by "ghi"

Do we find the first occurrence of "ghi"?  The last?  The first/last
such that the preceding negated pattern succeeds?  What do we do when a
negated pattern is followed by another negated pattern?  This starts to
smack too much of non-determinism, which is fine in its place, but
conventions enforcing determinism are quite useful in practice.

Having said all this, I think that negated patterns could be quite
useful.  More than one, I have wanted to search for occurrences of
"foo(XXX)", where "XXX" is not "bar".  The proper *practical*
convention for negated patterns should be driven how people intend to
use them, and I suggest that people submit proposed applications.  If
we can't come up with more complicated examples, perhaps the shortest
span such that the following pattern succeeds, with adjacent negated
patterns illegal, is satisfactory.

Stu Friedberg  stuart@cs.rochester.edu  {ames,cmcl2,rutgers}!rochester!edu