Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP
Path: utzoo!linus!philabs!pwa-b!mmintl!franka
From: franka@mmintl.UUCP (Frank Adams)
Newsgroups: net.math
Subject: Re: Data compression and information theory
Message-ID: <570@mmintl.UUCP>
Date: Wed, 7-Aug-85 14:54:53 EDT
Article-I.D.: mmintl.570
Posted: Wed Aug  7 14:54:53 1985
Date-Received: Sun, 11-Aug-85 03:48:40 EDT
References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ>
Reply-To: franka@mmintl.UUCP (Frank Adams)
Organization: Multimate International, E. Hartford, CT
Lines: 36
Summary: Guesses


In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>I have two sentence beginnings for you to guess the next letter of:
>1) "A" and 2) "".  In comparison to simple Huffman encodings, I don't expect
>very many people to get it right within 5 guesses.

1) blank; "n"; "r"; "t"; "f"

2) "T"; "A"; "I"; "W"; "F"

These won't cover 90% of the sentences beginning as indicated, but I think
they will get a clear majority.  And, of course, these are among the hardest
cases.  Try, for example, to guess the next letter in these sentences:
1) "This is th" and 2) "Where are you g".

>Are guesses really desirable in a compression system anyway ?  There's is
>no-one to say whether the guess was right or wrong ...

Sure there is; there is the actual document.  You have "guessing"
algorithm, which always produces the same guesses for the same text.

>An obvious argument against one-bit-per-character goes something like this :
>The average word has (say) five characters, which would imply that its
>information content can be represented with 5 bits.  This in turn would
>imply that there are around 2^5, or 32, valid words.  Rather limited.
>This is my interpretation of what one-bit-per-character means; if I have
>missed something, please correct me.

Yes, you have missed quite a bit.  First of all, you aren't justified in
using the average word size.  A better estimate would be that most words
are not more than about eight letters, so the total number is 2^9-1, or 511.
Again, this is not the number of valid words, but the typical number of
words to cover a majority of the possibilities in a particular context.
This seems quite plausible to me.

All this doesn't prove that one bit per letter is the appropriate estimate,
but it is much more plausible than your analysis would suggest.