From: utzoo!decvax!duke!harpo!floyd!cmcl2!lanl-a!jhf Newsgroups: net.unix-wizards Title: Re: diff alg ref query Article-I.D.: lanl-a.149 Posted: Wed Aug 4 10:09:17 1982 Received: Thu Aug 5 04:29:54 1982 I have a copy of report, apparently from Bell Labs, entitled "An algorithm for differential file comparison," by J. W. Hunt and M. D. McIlroy, Computing Science Technical Report #41. Because it is a Xerox copy of the published report, without a cover, I'm not absolutely sure it was issued by Bell Labs. At any rate, this is the only detailed account of the diff algorithm I have seen. The reason I am posting this information to the net, rather than replying to the request by mail, is that I want to follow up with another question about diff - It's a clever algorithm, and its greatest virtue is that it produces optimal results (except for perturbations caused by hashing collisions), but its limitation is said to be that it can't handle arbitrarily long files. My question is this: What's wrong with running the algorithm piecewise over big files? That is, finding the differences in the first few thousand lines of the files, then restarting at the points of the last difference found, and so forth. I suppose that the results would be nonoptimal, but they would be piecewise optimal, and presumably, the quality of the results would vary with the size of the pieces, giving one a nice tradeoff between memory requirements and quality of otuput. Notice that the usual dumb algorithm for file comparison is obtained as a degenerate case with a very small (1 or 2 lines) piece size. This all seems so obvious. Am I missing something?