Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.2 9/18/84; site lasspvax.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!mhuxr!mhuxt!houxm!mtuxo!mtunh!mtung!mtunf!ariel!vax135!cornell!lasspvax!norman
From: norman@lasspvax.UUCP (Norman Ramsey)
Newsgroups: net.math
Subject: Re: Data compression and information theory
Message-ID: <459@lasspvax.UUCP>
Date: Mon, 12-Aug-85 18:16:18 EDT
Article-I.D.: lasspvax.459
Posted: Mon Aug 12 18:16:18 1985
Date-Received: Sat, 17-Aug-85 14:46:35 EDT
References: <417@lasspvax.UUCP> <1010@mtgzz.UUCP> <854@mulga.OZ>
Reply-To: norman@lasspvax.UUCP (Norman Ramsey)
Organization: LASSP, Cornell University
Lines: 37
Summary: 

In article <854@mulga.OZ> bjpt@mulga.OZ (Benjamin Thompson) writes:
>In article <1010@mtgzz.UUCP> version B 2.10.2 (MU) 9/23/84; site mulga.OZ version B 2.10.PCS 1/10/84; site mtgzz.UUCP mulga!munnari!seismo!harvard!think!mit-eddie!genrad!decvax!tektronix!uw-beaver!cornell!vax135!ariel!mtunf!mtunh!mtuxo!mtgzz!dmt dmt@mtgz
>.UUCP (d.m.tutelman) writes:
>>The one-bit-per-character assertion comes from an old classic paper.
>>(Don't have a reference handy, but I believe it's by Claude Shannon
>>himself, published in BSTJ in the 1950s or even '40s.)

What Shannon claims in that paper is that the *redundancy* of English, as
measured by a variety of methods (one of which was character guessing) is
roughly 50%. This means that the amount of information carried in English
text is roughly 50% of the maximum possible with the same alphabet.
Shannon's alphabet was 26 letters plus word space, so a rough calculatio
says about 2.4 bits per character in English. If you use six letter words I
think you'll find this gives you an adequate numberr of words (thirty
thousand or so).

As far as the number of preceding characters we are allowed to see before we
guess, properly it's an infinite number, since the information content is a
property of a statistical ensemble of strings, and we hope everything is
ergodic so that instead of an infinite number of strings we can think about
an infinite length string. Of course this just means a string whose length
gets large, and I think in this context one can do very well with eight or
so characters.

If you want to do some quantitative measurements I think I posted something
about this earlier; you could actually look at substrings of length n, and
see how rapidly the informatio per character converges as n gets large.
-- 
Norman Ramsey

ARPA: norman@lasspvax  -- or --  norman%lasspvax@cu-arpa.cs.cornell.edu
UUCP: {ihnp4,allegra,...}!cornell!lasspvax!norman
BITNET: (in desperation only) ZSYJARTJ at CORNELLA
US Mail: Dept Physics, Clark Hall, Cornell University, Ithaca, New York 14853
Telephone: (607)-256-3944 (work)    (607)-272-7750 (home)

        Never eat anything with a shelf life of more than ten years