Path: utzoo!utgpu!watmath!watdragon!rose!ccplumb From: ccplumb@rose.waterloo.edu (Colin Plumb) Newsgroups: comp.sys.amiga.tech Subject: Re: huffman encoding Message-ID: <16926@watdragon.waterloo.edu> Date: 3 Oct 89 20:00:35 GMT References: <467@crash.cts.com> Sender: daemon@watdragon.waterloo.edu Reply-To: ccplumb@rose.waterloo.edu (Colin Plumb) Organization: U. of Waterloo, Ontario Lines: 103 In article <467@crash.cts.com> uzun@pnet01.cts.com (Roger Uzun) writes: >So I should have asked for the huffman bit patterns and bit counts >for the constant encoding tree that would be used on files that >have 256 unique elements. Er... one of us is confused. Huffman encoding relies on knowing the static distribution of however many values you're trying to represent. Without these probabilities, you can't create a Huffman encoding. Supposing you do have a set of probabilities, p_0, p_1, p_2,...,p_n, to get an optimal Huffman tree, sort them, take the two least likely elements of the list, and create a common parent for them. The probability of this parent is the sum of the probabilities of its children, and the encodings of its children are the encodings of the parent followed by "0" and "1". Remove the two child nodes from the set of probabilities and add the parent, preserving the sorting. Repeat, removing the two least likely elements and replacing them with a node whose probability is the sum, until you have only one node left. 1 choice is 0 bits of information and thus the encoding of this node is 0 bits long. If all the probabilities are equal, you'll find this produces equal-length encodings. Suppose we have 4 symbols, {A, B, C, D}, with probabilities {.4, .3, .2, .1}. Take the two least likely symbols, C and D, and make a parent for them (call it CD) with probability .1+.2=.3. The encoding of C ic CD0 and the encoding of D is CD1. The list is now {A, B, CD} (or it could be {A, CD, B}) with probabilites {.4, .3, .3}. Again merge the least likely two elements, producing node BCD with probability .6 and making the encoding of B = BCD0, CD = BCD1. The list is now {BCD, A} with probabilities {.6, .4} and the final merge gives a list {ABCD} with probability {1}. BCD = ABCD0 and A = ABCD1. The encodings work out to be: A = 1 B = 00 C = 010 D = 011 Of course, the choice of which child got "0" and which got "1" was arbitrary, so we could make it A = 0 B = 10 C = 110 D = 111 if we like. The average length of a symbol is 1*.4 + 2*.3 + 3*.2 + 3*.1 = .4+.6+.9 = 1.9, a slight improvement over the 2 bits it would take using a fixed-length code. If we give the scheme more symbols to play with, we get better results. Say we group the symbols into pairs, giving 16 symbols: AA = 0.16 AB = 0.12 AC = 0.08 AD = 0.04 BA = 0.12 BB = 0.09 BC = 0.06 BD = 0.03 CA = 0.08 CB = 0.06 CC = 0.04 CD = 0.02 DA = 0.04 DB = 0.03 DC = 0.02 DD = 0.01 This produces the encodings (it depends on the order in which you list equal elements, but they're all equally efficient): AA = 001 * .16 = .48 AB = 100 * .12 = .36 AC = 0001 * .08 = .32 AD = 1101 * .04 = .16 BA = 101 * .12 = .36 BB = 111 * .09 = .27 BC = 0110 * .06 = .24 BD = 01011 * .03 = .15 CA = 0100 * .08 = .32 CB = 0111 * .06 = .24 CC = 00001 * .04 = .20 CD = 11001 * .02 = .10 DA = 00000 * .04 = .20 DB = 11000 * .03 = .15 DC = 010100 * .02 = .12 DD = 010101 * .01 = .06 Total 1.00 3.73 This figure of 3.73/2 = 1.865 bits per symbol is better. Note that I piked a difficult-to-compress probability distribution. The Shannon theorem lower bound on the number of bits per symbol, assuming all symbols are independent, is 1.846439344... bits per symbol. Note that english text has much more skewed probabilities, and also has strong inter-symbol dependencies. Perfect compression of english text would use between 1.5 and 2 bits per character. Anyway, now do you know what you want? I'm still not quite sure, but have tried to answer the common qustions. -- -Colin