Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP
Path: utzoo!linus!philabs!pwa-b!mmintl!franka
From: franka@mmintl.UUCP (Frank Adams)
Newsgroups: net.math
Subject: Re: Data compression and information theory
Message-ID: <589@mmintl.UUCP>
Date: Mon, 12-Aug-85 18:02:03 EDT
Article-I.D.: mmintl.589
Posted: Mon Aug 12 18:02:03 1985
Date-Received: Wed, 14-Aug-85 20:34:15 EDT
References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ> <570@mmintl.UUCP> <858@mulga.OZ>
Reply-To: franka@mmintl.UUCP (Frank Adams)
Organization: Multimate International, E. Hartford, CT
Lines: 60
Summary: 


In article <858@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>>>Are guesses really desirable in a compression system anyway ?  There's is
>>>no-one to say whether the guess was right or wrong ...
>>
>>Sure there is; there is the actual document.  You have "guessing"
>>algorithm, which always produces the same guesses for the same text.
>
>The idea of compression is usually to save space [...]
>  As such, keeping the original document
>around is a) against the rules and b) not always possible (e.g. remote file
>transfer).  Guessing is not generally desirable in re-construction.

Let me clarify what a data compression algorithm based on "guessing"
might look like.  The basic encoding loop looks like:

	for each character in source;
		while (character != guess());
			output '1' bit;
		output '0' bit;

The guess function looks only at the document up to but not including the
current character, and returns the most likely next character the first time,
then the next most likely, etc.

The expansion algorithm is as follows:

	until end of file;
		while (bit '1' read);
			guess();
		output guess();

Note that *any* compression algorithm will sometimes make the result longer
than the input.

>However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about
>this, my view of what Shannon was doing has changed a bit.  It is effectively
>finding probabilities of certain characters in certain contexts (I claim
>this is basically the probabilites needed in Huffman-like codings, although
>I don't think Stavros believes me).  What his subjects were doing is clearly
>not feasible for a work-every-time compression system.

Yes, this is basically the probabilities needed in Huffman-like codings.
My sample algorithm assumes most-likely=1/2, next-most-likely=1/4, etc.
(As such, it is on average two bits per character.  A better result would
require encoding multiple following characters with a single bit in some
instances.)

What his subjects were doing is not feasible for a compression system;
the point is that it suggests that an optimal algorithm might be able
to acheive a one bit per character average.  How to guess what comes
next is the next question, to which there is as yet no answer.

> but I still wouldn't want to be restricted
>to just 511 words.

You are still missing a key point.  You aren't restricted to ~500 words.
It's just that, in any context (preceding portion of the document), there
is a list of about 500 words, such that most of the time, your next word
will be on the list.  You *can* use any word (or fhdias) that you want.