Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP Path: utzoo!linus!philabs!pwa-b!mmintl!franka From: franka@mmintl.UUCP (Frank Adams) Newsgroups: net.math Subject: Re: Data compression and information theory Message-ID: <570@mmintl.UUCP> Date: Wed, 7-Aug-85 14:54:53 EDT Article-I.D.: mmintl.570 Posted: Wed Aug 7 14:54:53 1985 Date-Received: Sun, 11-Aug-85 03:48:40 EDT References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ> Reply-To: franka@mmintl.UUCP (Frank Adams) Organization: Multimate International, E. Hartford, CT Lines: 36 Summary: Guesses In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >I have two sentence beginnings for you to guess the next letter of: >1) "A" and 2) "". In comparison to simple Huffman encodings, I don't expect >very many people to get it right within 5 guesses. 1) blank; "n"; "r"; "t"; "f" 2) "T"; "A"; "I"; "W"; "F" These won't cover 90% of the sentences beginning as indicated, but I think they will get a clear majority. And, of course, these are among the hardest cases. Try, for example, to guess the next letter in these sentences: 1) "This is th" and 2) "Where are you g". >Are guesses really desirable in a compression system anyway ? There's is >no-one to say whether the guess was right or wrong ... Sure there is; there is the actual document. You have "guessing" algorithm, which always produces the same guesses for the same text. >An obvious argument against one-bit-per-character goes something like this : >The average word has (say) five characters, which would imply that its >information content can be represented with 5 bits. This in turn would >imply that there are around 2^5, or 32, valid words. Rather limited. >This is my interpretation of what one-bit-per-character means; if I have >missed something, please correct me. Yes, you have missed quite a bit. First of all, you aren't justified in using the average word size. A better estimate would be that most words are not more than about eight letters, so the total number is 2^9-1, or 511. Again, this is not the number of valid words, but the typical number of words to cover a majority of the possibilities in a particular context. This seems quite plausible to me. All this doesn't prove that one bit per letter is the appropriate estimate, but it is much more plausible than your analysis would suggest.