Path: utzoo!utgpu!watmath!watdragon!rose!ccplumb
From: ccplumb@rose.waterloo.edu (Colin Plumb)
Newsgroups: comp.sys.amiga.tech
Subject: Re: huffman encoding
Message-ID: <16926@watdragon.waterloo.edu>
Date: 3 Oct 89 20:00:35 GMT
References: <467@crash.cts.com>
Sender: daemon@watdragon.waterloo.edu
Reply-To: ccplumb@rose.waterloo.edu (Colin Plumb)
Organization: U. of Waterloo, Ontario
Lines: 103

In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes:
>So I should have asked for the huffman bit patterns and bit counts
>for the constant encoding tree that would be used on files that
>have 256 unique elements.

Er... one of us is confused.  Huffman encoding relies on knowing the
static distribution of however many values you're trying to represent.
Without these probabilities, you can't create a Huffman encoding.

Supposing you do have a set of probabilities, p_0, p_1, p_2,...,p_n,
to get an optimal Huffman tree, sort them, take the two least likely
elements of the list, and create a common parent for them.  The probability
of this parent is the sum of the probabilities of its children, and the
encodings of its children are the encodings of the parent followed by "0"
and "1".  Remove the two child nodes from the set of probabilities and add
the parent, preserving the sorting.  Repeat, removing the two least likely
elements and replacing them with a node whose probability is the sum, until
you have only one node left.  1 choice is 0 bits of information and thus
the encoding of this node is 0 bits long.

If all the probabilities are equal, you'll find this produces equal-length
encodings.  Suppose we have 4 symbols, {A, B, C, D}, with probabilities
{.4, .3, .2, .1}.

Take the two least likely symbols, C and D, and make a parent for them
(call it CD) with probability .1+.2=.3.  The encoding of C ic CD0 and
the encoding of D is CD1.  The list is now {A, B, CD} (or it could be
{A, CD, B}) with probabilites {.4, .3, .3}.  Again merge the least likely
two elements, producing node BCD with probability .6 and making the encoding
of B = BCD0, CD = BCD1.  The list is now {BCD, A} with probabilities {.6, .4}
and the final merge gives a list {ABCD} with probability {1}.  BCD = ABCD0
and A = ABCD1.

The encodings work out to be:
A = 1
B = 00
C = 010
D = 011

Of course, the choice of which child got "0" and which got "1" was arbitrary,
so we could make it
A = 0
B = 10
C = 110
D = 111
if we like.

The average length of a symbol is 1*.4 + 2*.3 + 3*.2 + 3*.1 = .4+.6+.9 = 1.9,
a slight improvement over the 2 bits it would take using a fixed-length code.
If we give the scheme more symbols to play with, we get better results.

Say we group the symbols into pairs, giving 16 symbols:

AA = 0.16
AB = 0.12
AC = 0.08
AD = 0.04
BA = 0.12
BB = 0.09
BC = 0.06
BD = 0.03
CA = 0.08
CB = 0.06
CC = 0.04
CD = 0.02
DA = 0.04
DB = 0.03
DC = 0.02
DD = 0.01

This produces the encodings (it depends on the order in which you list equal
elements, but they're all equally efficient):

AA = 001	* .16 = .48
AB = 100	* .12 = .36
AC = 0001	* .08 = .32
AD = 1101	* .04 = .16
BA = 101	* .12 = .36
BB = 111	* .09 = .27
BC = 0110	* .06 = .24
BD = 01011	* .03 = .15
CA = 0100	* .08 = .32
CB = 0111	* .06 = .24
CC = 00001	* .04 = .20
CD = 11001	* .02 = .10
DA = 00000	* .04 = .20
DB = 11000	* .03 = .15
DC = 010100	* .02 = .12
DD = 010101	* .01 = .06
Total		 1.00  3.73

This figure of 3.73/2 = 1.865 bits per symbol is better.  Note that I piked
a difficult-to-compress probability distribution.  The Shannon theorem lower
bound on the number of bits per symbol, assuming all symbols are
independent, is 1.846439344... bits per symbol.  Note that english text has
much more skewed probabilities, and also has strong inter-symbol dependencies.
Perfect compression of english text would use between 1.5 and 2 bits per
character.

Anyway, now do you know what you want?  I'm still not quite sure, but have
tried to answer the common qustions.
-- 
	-Colin