From: utzoo!decvax!duke!harpo!floyd!cmcl2!lanl-a!jhf
Newsgroups: net.unix-wizards
Title: Re: diff alg ref query
Article-I.D.: lanl-a.149
Posted: Wed Aug  4 10:09:17 1982
Received: Thu Aug  5 04:29:54 1982


I have a copy of report, apparently from Bell Labs, entitled "An algorithm
for differential file comparison," by J.  W.  Hunt and M.  D.  McIlroy,
Computing Science Technical Report #41.  Because it is a Xerox copy of the
published report, without a cover, I'm not absolutely sure it was issued by
Bell Labs.  At any rate, this is the only detailed account of the diff
algorithm I have seen.

The reason I am posting this information to the net, rather than replying
to the request by mail, is that I want to follow up with another question
about diff - It's a clever algorithm, and its greatest virtue is that it
produces optimal results (except for perturbations caused by hashing
collisions), but its limitation is said to be that it can't handle
arbitrarily long files.  My question is this:  What's wrong with running
the algorithm piecewise over big files?  That is, finding the differences
in the first few thousand lines of the files, then restarting at the points
of the last difference found, and so forth.  I suppose that the results
would be nonoptimal, but they would be piecewise optimal, and presumably,
the quality of the results would vary with the size of the pieces, giving
one a nice tradeoff between memory requirements and quality of otuput.
Notice that the usual dumb algorithm for file comparison is obtained as a
degenerate case with a very small (1 or 2 lines) piece size.

This all seems so obvious.  Am I missing something?