Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP Posting-Version: version B 2.10.2 9/18/84; site lasspvax.UUCP Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!mtuxo!mtunh!mtung!mtunf!ariel!vax135!cornell!lasspvax!norman From: norman@lasspvax.UUCP (Norman Ramsey) Newsgroups: net.math Subject: Re: Data compression and information theory Message-ID: <459@lasspvax.UUCP> Date: Mon, 12-Aug-85 18:16:18 EDT Article-I.D.: lasspvax.459 Posted: Mon Aug 12 18:16:18 1985 Date-Received: Sat, 17-Aug-85 14:46:35 EDT References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ> Reply-To: norman@lasspvax.UUCP (Norman Ramsey) Organization: LASSP, Cornell University Lines: 37 Summary: In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes: >In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgz >.UUCP (d.m.tutelman) writes: >>The one-bit-per-character assertion comes from an old classic paper. >>(Don't have a reference handy, but I believe it's by Claude Shannon >>himself, published in BSTJ in the 1950s or even '40s.) What Shannon claims in that paper is that the *redundancy* of English, as measured by a variety of methods (one of which was character guessing) is roughly 50%. This means that the amount of information carried in English text is roughly 50% of the maximum possible with the same alphabet. Shannon's alphabet was 26 letters plus word space, so a rough calculatio says about 2.4 bits per character in English. If you use six letter words I think you'll find this gives you an adequate numberr of words (thirty thousand or so). As far as the number of preceding characters we are allowed to see before we guess, properly it's an infinite number, since the information content is a property of a statistical ensemble of strings, and we hope everything is ergodic so that instead of an infinite number of strings we can think about an infinite length string. Of course this just means a string whose length gets large, and I think in this context one can do very well with eight or so characters. If you want to do some quantitative measurements I think I posted something about this earlier; you could actually look at substrings of length n, and see how rapidly the informatio per character converges as n gets large. -- Norman Ramsey ARPA: norman@lasspvax -- or -- norman%lasspvax@cu-arpa.cs.cornell.edu UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman BITNET: (in desperation only) ZSYJARTJ at CORNELLA US Mail: Dept Physics, Clark Hall, Cornell University, Ithaca, New York 14853 Telephone: (607)-256-3944 (work) (607)-272-7750 (home) Never eat anything with a shelf life of more than ten years