compression [message #373801] |
Thu, 20 September 2018 15:52 |
|
Originally posted by: Anthony Adverse
Does anyone know historically, how we ended up with NuFX, SHK, LZW based compression rather than Huffman style compression which is so prevalent on the PeeCee
A
|
|
|
Re: compression [message #373807 is a reply to message #373801] |
Thu, 20 September 2018 19:04 |
|
Originally posted by: John Brooks
Huffman is bit-based and slow on the 6502 & 65816 compared to byte-based LZ (dictionary) compression algorithms.
Modern LZ4 is a hybrid codec which is byte-based, but packs two 4-bit nibbles per byte to improve efficiency. However LZ4 is still a dictionary codec and not an entropy codec like huffman, arithmetic, or ANS.
-JB
@JBrooksBSI
|
|
|
Re: compression [message #373879 is a reply to message #373807] |
Sat, 22 September 2018 01:15 |
|
Originally posted by: fadden
On Thursday, September 20, 2018 at 4:04:58 PM UTC-7, John Brooks wrote:
> Huffman is bit-based and slow on the 6502 & 65816 compared to byte-based LZ (dictionary) compression algorithms.
The LZW used in ShrinkIt uses a variable-width code (9-12 bits), so it's doing its share of shifting. The LZSS codec in HardPressed was slower when compressing than ShrinkIt because of the dictionary string matching, but much faster to decompress because it's byte-oriented. The lack of a barrel shift was a constant source of annoyance for me. :-)
The predecessors to ShrinkIt, notably BLU (Binary II Library Utility) and DDD, used a Huffman scheme. ShrinkIt's LZW compressed things more tightly, which was a significant advantage with the modem speeds of the day (9600bps? 14.4K?)
Perhaps relevant: https://fadden.com/apple2/hdc/index.html
|
|
|
Re: compression [message #373885 is a reply to message #373879] |
Sat, 22 September 2018 03:24 |
|
Originally posted by: Anthony Adverse
I always found it a little odd... Huffman appears to manage higher compression than LZW although as always it depends on the size and type of file being compressed. But for interests sake back in the dark ages, I tried feeding .shk files into ARJ on the pc. It could still manage to compress them significantly. Which always made me think, that the LZW in use was fairly inefficient. Maybe its shrinkit rather than just LZW per se... I'm not sure.
|
|
|
Re: compression [message #373886 is a reply to message #373885] |
Sat, 22 September 2018 04:23 |
|
Originally posted by: Tom Porter
Are there any binaries that are 'ready' made and compiled... all I have been finding is source code... the only real binaries I have found are hard-coded hires images compressors and decompressors... they are unsuitable for things like Double-Lores or general data.
|
|
|
Re: compression [message #373889 is a reply to message #373879] |
Sat, 22 September 2018 06:17 |
|
Originally posted by: John Brooks
On Friday, September 21, 2018 at 10:15:05 PM UTC-7, fadden wrote:
> On Thursday, September 20, 2018 at 4:04:58 PM UTC-7, John Brooks wrote:
>> Huffman is bit-based and slow on the 6502 & 65816 compared to byte-based LZ (dictionary) compression algorithms.
>
> The LZW used in ShrinkIt uses a variable-width code (9-12 bits), so it's doing its share of shifting. The LZSS codec in HardPressed was slower when compressing than ShrinkIt because of the dictionary string matching, but much faster to decompress because it's byte-oriented. The lack of a barrel shift was a constant source of annoyance for me. :-)
>
> The predecessors to ShrinkIt, notably BLU (Binary II Library Utility) and DDD, used a Huffman scheme. ShrinkIt's LZW compressed things more tightly, which was a significant advantage with the modem speeds of the day (9600bps? 14.4K?)
>
> Perhaps relevant: https://fadden.com/apple2/hdc/index.html
Nice writeup of all those compression formats Andy.
Here are a few other important codecs:
aPLib - a dictionary codec with bit-shifting. Used in qkumba's single-file games.
https://groups.google.com/forum/#!searchin/comp.sys.apple2/a plib%7Csort:date/comp.sys.apple2/h8GcbRm1bic/M5d9NsCW9FkJ
http://pferrie.host22.com/misc/appleii.htm
LZ4 - faster than aPLib but less compression.
https://www.brutaldeluxe.fr/products/crossdevtools/lz4/index .html
https://github.com/lz4/lz4
http://pferrie.host22.com/misc/appleii.htm
ANS/FSE - byte-based entropy coder which largely replaces huffman & arithmetic coding. Used in zstd & Apple's lzfse,
https://github.com/facebook/zstd
https://github.com/lzfse/lzfse
Other good codecs which are not that practical for AppleII platforms:
Deflate (zip)
LZMA
Brotli
PAQ
-JB
@JBrooksBSI
|
|
|
Re: compression [message #373910 is a reply to message #373885] |
Sat, 22 September 2018 14:22 |
|
Originally posted by: fadden
On Saturday, September 22, 2018 at 12:24:35 AM UTC-7, Anthony Adverse wrote:
> I always found it a little odd... Huffman appears to manage higher compression than LZW although as always it depends on the size and type of file being compressed. But for interests sake back in the dark ages, I tried feeding .shk files into ARJ on the pc. It could still manage to compress them significantly. Which always made me think, that the LZW in use was fairly inefficient. Maybe its shrinkit rather than just LZW per se... I'm not sure..
It depends on what you're compressing. Adaptive Huffman does extremely well on base64-encoded files, but string-matching algorithms will beat it easily on English text. For Apple II disk images -- the thing ShrinkIt was originally designed to operate on -- LZW wins consistently. (CiderPress can create DDD images if you want to experiment.)
ShrinkIt has a bunch of stuff in the headers, and 12-bit LZW restarts its dictionary fairly often. It doesn't really get near the entropy of the source. I tried an experiment with an old spam mailbox file just for fun:
Original: 49012729
SHK: 37585096
SHK+ARJ: 32558597
ARJ: 25596612
bzip2: 23475441
bzip2+ARJ: nope
ncompress, which uses 16-bit LZW, only reduced the file to 37280057. ARJ took that down to 31047576. So I'd blame LZW.
(LZ78 algorithms like LZW will theoretically reach the entropy of the input once you've fed enough input, assuming you have infinite memory and a great deal of patience. LZ77 algorithms like LZH don't reach the same theoretic limit, but get close a whole lot faster.)
|
|
|
Re: compression [message #373911 is a reply to message #373910] |
Sat, 22 September 2018 14:50 |
|
Originally posted by: John Brooks
On Saturday, September 22, 2018 at 11:22:40 AM UTC-7, fadden wrote:
> On Saturday, September 22, 2018 at 12:24:35 AM UTC-7, Anthony Adverse wrote:
>> I always found it a little odd... Huffman appears to manage higher compression than LZW although as always it depends on the size and type of file being compressed. But for interests sake back in the dark ages, I tried feeding .shk files into ARJ on the pc. It could still manage to compress them significantly. Which always made me think, that the LZW in use was fairly inefficient. Maybe its shrinkit rather than just LZW per se... I'm not sure.
>
>
> It depends on what you're compressing. Adaptive Huffman does extremely well on base64-encoded files, but string-matching algorithms will beat it easily on English text. For Apple II disk images -- the thing ShrinkIt was originally designed to operate on -- LZW wins consistently. (CiderPress can create DDD images if you want to experiment.)
>
> ShrinkIt has a bunch of stuff in the headers, and 12-bit LZW restarts its dictionary fairly often. It doesn't really get near the entropy of the source. I tried an experiment with an old spam mailbox file just for fun:
>
> Original: 49012729
> SHK: 37585096
> SHK+ARJ: 32558597
> ARJ: 25596612
> bzip2: 23475441
> bzip2+ARJ: nope
>
> ncompress, which uses 16-bit LZW, only reduced the file to 37280057. ARJ took that down to 31047576. So I'd blame LZW.
>
> (LZ78 algorithms like LZW will theoretically reach the entropy of the input once you've fed enough input, assuming you have infinite memory and a great deal of patience. LZ77 algorithms like LZH don't reach the same theoretic limit, but get close a whole lot faster.)
Good info/comparison of leading codecs:
Pareto frontier:
http://richg42.blogspot.com/2015/11/the-lossless-decompressi on-pareto.html
Hutter prize:
http://mattmahoney.net/dc/text.html
-JB
@JBrooksBSI
|
|
|
Re: compression [message #373934 is a reply to message #373885] |
Sun, 23 September 2018 00:58 |
|
Originally posted by: kegs
In article <fedec7f2-739b-4454-b149-88ba689df162@googlegroups.com>,
Anthony Adverse <the.ertceps@gmail.com> wrote:
> I always found it a little odd... Huffman appears to manage higher
> compression than LZW although as always it depends on the size and type
> of file being compressed. But for interests sake back in the dark ages,
> I tried feeding .shk files into ARJ on the pc. It could still manage to
> compress them significantly. Which always made me think, that the LZW in
> use was fairly inefficient. Maybe its shrinkit rather than just LZW per
> se... I'm not sure.
>
If ARJ uses Huffman, it's as a minor additional step. I believe ARJ is
in the LZ77 family. This is much more advanced than a practical Apple II
compression program could use. ARJ is not open source, and reverse-
engineering these things is very tedious, so I'm not exactly sure.
The Apple II can only practically run algorithms where the program plus
the data structures it uses is less than 32KB. That's just not that much
memory, PCs (and the IIgs) can run much more complex algorithms. I think
12-bit LZW is a pretty good fit.
The reason LZW compressed data can be compressed even further comes down to the
patterns in the upper bits. The first 256 symbols are encoded as 9-bits, but
most of those extra bits are 0, (assuming not much compression at first).
Looking at the output as bytes, 9 output bytes will contain exactly 8 symbols,
so there's a pattern to be extracted at the byte level. Even if LZW is
compressing the data in an OK way, there's patterns in these upper bits (and
lower bits) of the symbols. The next 512 output symbols are 10-bits, and
again, most of those upper bits are not very "random" looking, and now the
pattern repeats every 5 bytes. So an algorithm taking more time and memory
can find this redundant info and squeeze a little more out of it.
My guess is better compression algorithms can make up 10-20% of the difference
when operating on LZW data. So, if LZW compressed 100KB to 50KB, and gzip/etc.
can compress it to 25KB directly, then gzip on LZW should give 10-20% of
25KB of saving off the 50KB, which is a final size of around 40-45KB.
The compression algorithm in ShrinkIt was designed to meet the following goals:
1. Be able to compress the cracked version of Hitchhiker's Guide to the Galaxy
so the compressed file could fit on a regular ProDOS 5.25" disk.
2. Be fast enough that on compressing, the disk drive would stay on
(so it had to compress at more than 4K bytes/second).
3. Run on any Apple II.
For #2, it didn't meet the goal on a normal Apple II, but does on a IIgs
on Fast (2.8MHz) mode, running the 6502 code.
Straight 12-bit LZW didn't achieve #1, so RLE was used as a pre-pass to get
a little extra. Let's look at a 4KB track which is all 0's. LZW's maximum
compression is roughly n symbols to encode n*(n+1)/2 bytes, so n=90 symbols for
4KB. At 9 bits for each symbol, thats 101 bytes. RLE can compress 4096 bytes
to 16*3=48 bytes. And then LZW on top of RLE gets a little more, about 18
symbols of 9 bits each = 21 bytes total. And, some tracks of HGTTG didn't
compress at all, since that data was already commpressed and so not a good fit
for LZW, and actually got longer after LZW. So the algorithm just passes that
data through raw and uncompressed. In the end, HGTTG compressed fits on a
normal ProDOS 5.25" disk, but it needed all of these tricks to get there.
The previous common compressor used for disks was Dalton's Disk Disintegrator,
which used a clever Huffman trick. This is off the top of my head, and it's
30 years old, so the details are likely slightly wrong. DDD had a fixed
Huffman tree for 32 bytes, (or maybe 16), such that it could use 2-8 bits to
encode those. Then all other bytes would take 9 bits each. It had to figure
out which were the most common bytes on a track, and then use its fixed-tree
to encode those to be smaller, and all other bytes get a little longer.
Basically, the most common byte would get the 2-bit encoding, the next most
common byte would get a 3-bit encoding, etc. Huffman's big problem is in
sending the tree to the decompressor, and DDD got it down to just these 32
bytes. But it's compression wasn't that great.
Kent
|
|
|
Re: compression [message #373953 is a reply to message #373934] |
Sun, 23 September 2018 11:49 |
|
Originally posted by: fadden
On Saturday, September 22, 2018 at 9:58:45 PM UTC-7, Kent Dickey wrote:
[...]
Kent!
(For those who may have forgotten, Kent Dickey developed the LZW compression code used in ShrinkIt. He also provided the C implementation used in NuLib.)
> The previous common compressor used for disks was Dalton's Disk Disintegrator,
> which used a clever Huffman trick. [...] DDD had a fixed
> Huffman tree for 32 bytes, (or maybe 16), such that it could use 2-8 bits to
> encode those.
Oddly enough, it's 20 bytes. There's a fairly literal translation of the assembly code here:
https://github.com/fadden/ciderpress/blob/master/diskimg/DDD .cpp#L399
> Huffman's big problem is in
> sending the tree to the decompressor, and DDD got it down to just these 32
> bytes. But it's compression wasn't that great.
IIRC, gzip uses a clever scheme where they output the characters in order, followed by the width of the tree at each level. The alternative is adaptive Huffman, but the overhead of maintaining the model slows everything down.
DDD Deluxe did a slightly better job of compressing things, but I never did dig into that one.
|
|
|
Re: compression [message #373965 is a reply to message #373953] |
Sun, 23 September 2018 15:28 |
Michael J. Mahon
Messages: 1767 Registered: October 2012
Karma: 0
|
Senior Member |
|
|
fadden <thefadden@gmail.com> wrote:
> On Saturday, September 22, 2018 at 9:58:45 PM UTC-7, Kent Dickey wrote:
> [...]
>
> Kent!
>
> (For those who may have forgotten, Kent Dickey developed the LZW
> compression code used in ShrinkIt. He also provided the C implementation used in NuLib.)
Not to mention KEGS (oops, I mentioned it ;-).
--
-michael - NadaNet 3.1 and AppleCrate II: http://michaeljmahon.com
|
|
|
Re: compression [message #373970 is a reply to message #373965] |
Sun, 23 September 2018 19:24 |
David Schmidt
Messages: 993 Registered: October 2012
Karma: 0
|
Senior Member |
|
|
On 9/23/2018 3:28 PM, Michael J. Mahon wrote:
> fadden <thefadden@gmail.com> wrote:
>> On Saturday, September 22, 2018 at 9:58:45 PM UTC-7, Kent Dickey wrote:
>> [...]
>>
>> Kent!
>>
>> (For those who may have forgotten, Kent Dickey developed the LZW
>> compression code used in ShrinkIt. He also provided the C implementation used in NuLib.)
>
> Not to mention KEGS (oops, I mentioned it ;-).
Which forms the foundation and just about all of the guts for pretty
much every GS emulator out there besides Bernie. Thanks again, Kent.
|
|
|
Re: compression [message #373972 is a reply to message #373970] |
Sun, 23 September 2018 22:26 |
|
Originally posted by: kegs
In article <po97ae$9e2$1@dont-email.me>,
David Schmidt <schmidtd@my-deja.com> wrote:
> On 9/23/2018 3:28 PM, Michael J. Mahon wrote:
>> fadden <thefadden@gmail.com> wrote:
>>> On Saturday, September 22, 2018 at 9:58:45 PM UTC-7, Kent Dickey wrote:
>>> [...]
>>>
>>> Kent!
>>>
>>> (For those who may have forgotten, Kent Dickey developed the LZW
>>> compression code used in ShrinkIt. He also provided the C
> implementation used in NuLib.)
>>
>> Not to mention KEGS (oops, I mentioned it ;-).
>
> Which forms the foundation and just about all of the guts for pretty
> much every GS emulator out there besides Bernie. Thanks again, Kent.
I'm looking into fixing up KEGS so it builds on Mac OS X again (that is, not a
Carbon app any more). I work pretty slowly, so don't expect anything soon.
My first step is delete a lot of useless code--there's a lot in KEGS to make
it fast on 1995 era machines which is all pointless now, and it complicates
my new simpler interface between *driver.c and video.c. And Linux has gone
and removed /dev/dsp, so I have to figure that out, too.
Kent
|
|
|
Re: compression [message #373981 is a reply to message #373972] |
Mon, 24 September 2018 09:46 |
David Schmenk
Messages: 374 Registered: December 2012
Karma: 0
|
Senior Member |
|
|
On Sunday, 23 September 2018 19:29:36 UTC-7, Kent Dickey wrote:
>
> I'm looking into fixing up KEGS so it builds on Mac OS X again (that is, not a
> Carbon app any more). I work pretty slowly, so don't expect anything soon.
> My first step is delete a lot of useless code--there's a lot in KEGS to make
> it fast on 1995 era machines which is all pointless now, and it complicates
> my new simpler interface between *driver.c and video.c. And Linux has gone
> and removed /dev/dsp, so I have to figure that out, too.
>
> Kent
Don't worry, whatever you figure out will get changed in 6 months anyway.
Thanks for all the work you've done for our community!
Dave...
|
|
|
Re: compression [message #374014 is a reply to message #373934] |
Mon, 24 September 2018 20:50 |
|
Originally posted by: kegs
In article <T5mdnaK_L8TigDrGnZ2dnUU7-I3NnZ2d@giganews.com>,
Kent Dickey <kegs@provalid.com> wrote:
> The compression algorithm in ShrinkIt was designed to meet the following goals:
>
> 1. Be able to compress the cracked version of Hitchhiker's Guide to the Galaxy
> so the compressed file could fit on a regular ProDOS 5.25" disk.
I know this totally doesn't matter, but it's bugging me that I said the
wrong game--the game was Bureaucracy.
Kent
|
|
|
Re: compression [message #374015 is a reply to message #374014] |
Mon, 24 September 2018 21:17 |
Steve Nickolas
Messages: 2036 Registered: October 2012
Karma: 0
|
Senior Member |
|
|
On Mon, 24 Sep 2018, Kent Dickey wrote:
> In article <T5mdnaK_L8TigDrGnZ2dnUU7-I3NnZ2d@giganews.com>,
> Kent Dickey <kegs@provalid.com> wrote:
>> The compression algorithm in ShrinkIt was designed to meet the following goals:
>>
>> 1. Be able to compress the cracked version of Hitchhiker's Guide to the Galaxy
>> so the compressed file could fit on a regular ProDOS 5.25" disk.
>
> I know this totally doesn't matter, but it's bugging me that I said the
> wrong game--the game was Bureaucracy.
>
> Kent
Right author, wrong game, wrong version of the Z-machine... xD
-uso.
|
|
|