Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 (MU) 9/23/84; site mulga.OZ Path: utzoo!linus!philabs!cmcl2!seismo!munnari!mulga!bjpt From: bjpt@mulga.OZ (Benjamin Thompson) Newsgroups: net.math Subject: Re: Data compression and information theory Message-ID: <858@mulga.OZ> Date: Fri, 9-Aug-85 03:53:34 EDT Article-I.D.: mulga.858 Posted: Fri Aug 9 03:53:34 1985 Date-Received: Mon, 12-Aug-85 04:00:14 EDT References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ> <570@mmintl.UUCP> Reply-To: bjpt@mulga.OZ (Benjamin Thompson) Organization: Comp Sci, Melbourne Uni, Australia Lines: 42 In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >>Are guesses really desirable in a compression system anyway ? There's is >>no-one to say whether the guess was right or wrong ... > >Sure there is; there is the actual document. You have "guessing" >algorithm, which always produces the same guesses for the same text. The idea of compression is usually to save space (although it doesn't always work that way - my supervisor was testing out a coding scheme and managed to achieve a 600% increase! This occurred because she was outputting bits as characters rather than bits ...). As such, keeping the original document around is a) against the rules and b) not always possible (e.g. remote file transfer). Guessing is not generally desirable in re-construction. However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about this, my view of what Shannon was doing has changed a bit. It is effectively finding probabilities of certain characters in certain contexts (I claim this is basically the probabilites needed in Huffman-like codings, although I don't think Stavros believes me). What his subjects were doing is clearly not feasible for a work-every-time compression system. >>An obvious argument against one-bit-per-character goes something like this : >>The average word has (say) five characters, which would imply that its >>information content can be represented with 5 bits. This in turn would >>imply that there are around 2^5, or 32, valid words. Rather limited. >>This is my interpretation of what one-bit-per-character means; if I have >>missed something, please correct me. > >Yes, you have missed quite a bit. First of all, you aren't justified in >using the average word size. A better estimate would be that most words >are not more than about eight letters, so the total number is 2^9-1, or 511. >Again, this is not the number of valid words, but the typical number of >words to cover a majority of the possibilities in a particular context. >This seems quite plausible to me. > >All this doesn't prove that one bit per letter is the appropriate estimate, >but it is much more plausible than your analysis would suggest. I have to admit my analysis was a bit tongue in cheek. Frank's estimate is probably more reasonable, but I still wouldn't want to be restricted to just 511 words. Stavros mentioned a system operating on 2 bits per character (heaps of words). I find this plausible.