Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.1 6/24/83; site mmintl.UUCP Path: utzoo!linus!philabs!pwa-b!mmintl!franka From: franka@mmintl.UUCP (Frank Adams) Newsgroups: net.math Subject: Re: Data compression and information theory Message-ID: <589@mmintl.UUCP> Date: Mon, 12-Aug-85 18:02:03 EDT Article-I.D.: mmintl.589 Posted: Mon Aug 12 18:02:03 1985 Date-Received: Wed, 14-Aug-85 20:34:15 EDT References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ> <570@mmintl.UUCP> <858@mulga.OZ> Reply-To: franka@mmintl.UUCP (Frank Adams) Organization: Multimate International, E. Hartford, CT Lines: 60 Summary: In article <858@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >In article <570@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: >>>Are guesses really desirable in a compression system anyway ? There's is >>>no-one to say whether the guess was right or wrong ... >> >>Sure there is; there is the actual document. You have "guessing" >>algorithm, which always produces the same guesses for the same text. > >The idea of compression is usually to save space [...] > As such, keeping the original document >around is a) against the rules and b) not always possible (e.g. remote file >transfer). Guessing is not generally desirable in re-construction. Let me clarify what a data compression algorithm based on "guessing" might look like. The basic encoding loop looks like: for each character in source; while (character != guess()); output '1' bit; output '0' bit; The guess function looks only at the document up to but not including the current character, and returns the most likely next character the first time, then the next most likely, etc. The expansion algorithm is as follows: until end of file; while (bit '1' read); guess(); output guess(); Note that *any* compression algorithm will sometimes make the result longer than the input. >However, after talking with Stavros Macrakis (seismo!harvard!macrakis) about >this, my view of what Shannon was doing has changed a bit. It is effectively >finding probabilities of certain characters in certain contexts (I claim >this is basically the probabilites needed in Huffman-like codings, although >I don't think Stavros believes me). What his subjects were doing is clearly >not feasible for a work-every-time compression system. Yes, this is basically the probabilities needed in Huffman-like codings. My sample algorithm assumes most-likely=1/2, next-most-likely=1/4, etc. (As such, it is on average two bits per character. A better result would require encoding multiple following characters with a single bit in some instances.) What his subjects were doing is not feasible for a compression system; the point is that it suggests that an optimal algorithm might be able to acheive a one bit per character average. How to guess what comes next is the next question, to which there is as yet no answer. > but I still wouldn't want to be restricted >to just 511 words. You are still missing a key point. You aren't restricted to ~500 words. It's just that, in any context (preceding portion of the document), there is a list of about 500 words, such that most of the time, your next word will be on the list. You *can* use any word (or fhdias) that you want.