Path: utzoo!utgpu!water!watmath!clyde!att!osu-cis!tut.cis.ohio-state.edu!mailrus!rutgers!rochester!stuart From: stuart@cs.rochester.edu (Stuart Friedberg) Newsgroups: comp.editors Subject: Re: pattern matches Summary: problems with negated patterns and a suggestion Message-ID: <11070@sol.ARPA> Date: 5 Jul 88 03:37:38 GMT References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu> Organization: U of Rochester, CS Dept., Rochester, NY Lines: 74 In article <7618@watdragon.waterloo.edu> Scott M. King writes: >If I have the pattern "abc(d.*e)^ghi", and the line "abcdxxxghi", then >what gets matched, if anything? The abc part is easy. Then, the pattern >matcher tries to match a d.*e but it doesn't find one. >That means the (d.*e)^ is successful, but since it didn't find anything, >it has to start looking for the ghi at "dxxxghi", and doesn't find it. >Thus the match fails. Perhaps the original suggestion wasn't absolutely clear. I can't decide between a couple of interpretations, but I don't agree with King when he says "That means the (d.*e)^ is successful, but since it didn't find anything, ..." The pattern DID find something. It found a string that was not a 'd' followed by anything followed by an 'e'. The hard part is deciding just *which* string it matched. In at least one interpretation, this pattern *does* match the input, because you can, in fact, find an "abc", followed by something that is not a "(d.*e)", followed by a "ghi". abcdxxxghi ^^^ matched by pattern "abc" ^^^^ matched by pattern "(d.*e)^" ^^^ matched by pattern "ghi" However, the conventional (but certainly not necessary) interpretation of regular expressions, at least in the Unix tool world, is that they match as much as possible. Under that interpretation, the pattern doesn't completely match the input, because the "(d.*e)^" matches too much, leaving nothing for the "ghi". abcdxxxghi ^^^ matched by pattern "abc" ^^^^^^^ matched by pattern "(d.*e)^" This is a problem. Accepting anything *but* a pattern in almost all cases accepts too much, under the usual interpretation. Negated patterns should *not* match the longest possible string for the reason given above, but neither should they match the shortest possible string, because "(d.*e)^" would match any two character string not equal to "de". Well, we could treat negated patterns "(d.*e)^" by looking for the next positive pattern "ghi", then testing the stuff we skip over for the undesirable pattern. That seems reasonable at first, but what do we do for input like "abcdxxxghixxxghi"? Do we have: abcdxxxghixxxghi ^^^ matched by "abc" ^^^^ matched by "(d.*e)^" ^^^ matched by "ghi" or, perhaps: abcdxxxghixxxghi ^^^ matched by "abc" ^^^^^^^^^^ matched by "(d.*e)^" ^^^ matched by "ghi" Do we find the first occurrence of "ghi"? The last? The first/last such that the preceding negated pattern succeeds? What do we do when a negated pattern is followed by another negated pattern? This starts to smack too much of non-determinism, which is fine in its place, but conventions enforcing determinism are quite useful in practice. Having said all this, I think that negated patterns could be quite useful. More than one, I have wanted to search for occurrences of "foo(XXX)", where "XXX" is not "bar". The proper *practical* convention for negated patterns should be driven how people intend to use them, and I suggest that people submit proposed applications. If we can't come up with more complicated examples, perhaps the shortest span such that the following pattern succeeds, with adjacent negated patterns illegal, is satisfactory. Stu Friedberg stuart@cs.rochester.edu {ames,cmcl2,rutgers}!rochester!edu