Path: utzoo!utgpu!water!watmath!watdragon!lion!smking From: smking@lion.waterloo.edu (Scott M. King) Newsgroups: comp.editors Subject: Re: pattern matches Summary: Negative patterns NOT well defined Keywords: pattern matcher regular expression Message-ID: <7648@watdragon.waterloo.edu> Date: 6 Jul 88 20:59:11 GMT References: <427@grand.UUCP> <37200009@m.cs.uiuc.edu> <7618@watdragon.waterloo.edu> <11070@sol.ARPA> <18838@cornell.UUCP> Sender: daemon@watdragon.waterloo.edu Reply-To: smking@lion.waterloo.edu (Scott M. King) Organization: U. of Waterloo, Ontario Lines: 61 In article <18838@cornell.UUCP> blandy@cs.cornell.edu (Jim Blandy) writes: >I do think negation is well-defined; using the proposed syntax, (pat)^ >matches any string pat would not. Since the set of strings matched by >pat is (presumably) well-defined, the set for (pat)^ is too. Wrong. You have already seen many different interpretations of how (pat)^ could be defined, some clearly wrong. It is not obvious what some patterns (containing negation) should match on some strings. A definition for *what* gets matched by a normal regular expression (not *how* it gets matched) could be stated as follows. Given an arbitrary pattern and a line to search, there could be many *possible* matches of the pattern. Ie., there may be many substrings in the line that could be matched by the pattern. To choose what the ultimate match is, first determine the first character in the line that is part of a possible match. Then, discard any possible matches that do not start on this character. Finally, choose the longest substring of the possible matches that remain. This is the ultimate match. Now, what happens if we apply this algorithm to negated patterns? Take the pattern "(a.*b)^cd", and the line "awxybzcd". Well, the possible matches are "awxybzcd", "wxybzcd", "xybzcd", "ybzcd", "bzcd", "zcd" or "cd". The possible match that starts at the earliest character position is the entire line. This matches because "awxybz" is a possible match for "(a.*b)^". Ie., it is not "an a followed by anything followed by a b". Using your wording, (a.*b)^ matches any string (a.*b) would not, and (a.*b) would clearly not match all of "awxybz", so (a.*b)^ does match it. I think the most desirable result from the previous example would be no match. This follows if the definition of negated patterns is made in terms of the pattern being negated: In (rexp)^, match rexp as usual. Then, if rexp was found, cause (rexp)^ to fail. Otherwise, if rexp was not found, cause (rexp)^ to succeed. So, applying this definition, the pattern matcher would look for a.*b as usual, and since it finds "awxyb", (a.*b)^ fails, and so this branch of the state machine dies off. Then, a.*b can not be found starting at the second character in the line, so (a.*b)^ succeeds starting at "wxybzcd". Back to the question of *what* this success of (a.*b)^ actually matches. I still maintain that the only reasonable and consistent definition would cause such a success to match zero characters. Any other arbitrary rule defining what non-null set of characters gets matched is sure to be wrong in many cases. And besides, if the negated pattern matches zero characters, you can always explicitly say what to match. Eg: using my definition, I can write the pattern "abc((d*e)^...|(.*b)^d..e)fg", which means 'match abc. Then, if this is found, look for "d*e". If this fails, match any three characters. If "(d*e)^..." was not found, then look for ".*b". If this fails, match "d..e". If anything was matched after the "abc", then match "fg". Else fail. From my first example, "(a.*b)^cd", the "look for the "cd", and then test what comes in between" rule that someone suggested can be represented with my definition with the pattern "(a.*b)^.*cd". Anyway, if (rexp)^ only tests for rexp and inverts the result, then there should also be a pattern construct to test for a positive pattern. (I previously thought that the "!" in (rexp)! could mean "test for rexp". A slightly more mnemonic character is "?", so "(rexp)?" would test for rexp rather that match it.) -- Scott M. King