Lempel-Ziv Compression Made Swift (method of J. A. Woods), part 1

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site ames.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!mhuxl!houxm!houxz!vax135!cornell!
uw-beaver!tektronix!hplabs!ames!jaw
From: j...@ames.UUCP (James A. Woods)
Newsgroups: net.unix-wizards,net.research
Subject: Lempel-Ziv Compression Made Swift (method of J. A. Woods), part 1
Message-ID: <425@ames.UUCP>
Date: Thu, 26-Jul-84 00:39:02 EDT
Article-I.D.: ames.425
Posted: Thu Jul 26 00:39:02 1984
Date-Received: Sat, 21-Jul-84 03:39:50 EDT
Organization: NASA-Ames Research Center, Mtn. View, CA
Lines: 109

#  "It's fresh squozen daily."

	-- official slogan of the Ames fast file finder
           (yet another data compression technique)

     Herein we report on a simple and fleet implementation of the LZW algorithm
now making the rounds on UNIX.  For equivalent compression rates (2:1 for
unix-wizards, 3:1 and more for numerical data), it runs at least 1.6 times
faster for large files than the code offered by Spencer Thomas.  This
is a C language comparison only; a greater speed differential is expected
if both versions were written in machine language).  This method may be
impractical for output codes of length greater than 13 bits, though,
fortunately, the basic algorithm offers only diminishing returns beyond this.

     Like Mr. Thomas, I adapted the Welch implementation to a different
data structure, believing that the hashing method reported in the June IEEE
Computer to be "too slow".  Essentially, the code below is a direct-access
lookup of the alternate string table pictured in the article.  There are
2 ** BITS codes which serve as prefixes to be associated with any of 256 ascii
characters.  For the typical BITS == 12, this looks unwieldy, but is actually
workable on a 32-bit processor with the enhancements I describe.  A first
pass took the form:

	short table [(1 << BITS) * 256];   /* 2 megabytes for BITS = 12 */

	squash ( )
	{
		register int K, i, p, prefix;
		register long code;

		bzero ( table, 1 << BITS );	/* clear table, but see below */

		for ( i = 0; i < 256; i++ ) 
			table [i] = i; 

		prefix = getchar ( );

		while ( (K = getchar ( )) != EOF ) {
			code = (long) ((K << BITS) + prefix);
			/*
		   	    test for code in "string" table
			*/
			if ( (p = table [code]) != 0 )
				prefix = p;
			else {			/* put code in table */
				output ( prefix );
				prefix = K;
				if ( i < (1 << BITS) )    /* ran out of codes */
					table [code] = (short) i++;
			}
		}
		output ( prefix );		/* last one */
	}

The table initialization time (Is bcopy() the fastest way to do this in C?)
is disconcerting, some 6 seconds of system time (2 MB) on our 11/750.  This is
unacceptable for small files.  It turns out that, using a variant of the
sparse array initialization trick discussed by J. Bentley (CACM, 8/83), we
can associate a bit-vector with the large code table, and save time by
setting a bit in this map before each read access to the table, and by
testing the bit before each write.

     The macros provided in M. D. McIlroy's now defunct V7 "spell" code,
can be used for these bit tests:

	#define SHIFT	5
	#define get(h)	(bits[h>>SHIFT]&(1<<((long)h&((1<<SHIFT)-1))))
	#define set(h)	bits[h>>SHIFT] |= 1<<((long)h&((1<<SHIFT)-1))

How's that for spiff?  It compiles into just a few VAX instructions and
beats the mod operator hands down.  If you give up, look at p. 47 in K & R.
Though probabilistic Bloom filtering for spelling error detection fades
away, the bit macros live on!  Integrating this in the above fragment is 
omitted here for clarity.

     Another subtle point:  originally the indexing

	code = (long) ((K << BITS) + prefix);

was, keeping in spirit with Welch's alternate string table exposition:

	code = (long) ((prefix << BITS) + K);

But we discovered that (on a virtual memory machine at least), the first
form easily halves (!) system and user time (and space) for text files,
since array accesses, in the absence of parity marked characters, neatly
clusters things in the top half (one meg here) of the array.

     Just for a lark, I also wired in special-case code to keep
prefix / character 040 (space) pairs in their own fast storage, since
most net English [sic] (and also numerical tables) are 15% space.  With
these modifications and using Thomas's exact output() routine but with a doubled
fwrite() buffer size, the following obtains:

			Input:   200 KB unix-wizards
			Machine: VAX 11/750, BSD 4.2

	program	             user    system   % compression
                             time     time

	pack                 15.2      4.4        34.0	      Sys V.1 port
        compress (16 bit)    45.2     13.2        59.6
	compact             221.2      4.9        34.0
	compress (12 bit)    35.0      1.4        43.1        assembler output()
	compress (13 bit)    35.7      1.8        48.3		  function   
	squash   (12 bit)    18.4      2.9        43.1                "
	squash   (13 bit)    19.1      4.5        48.3	              "

Comments, largely of a philosophical nature, are continued in part 2 ...

Relay-Version: version B 2.10 5/3/83; site utzoo.UUCP
Posting-Version: version B 2.10.1 6/24/83; site ames.UUCP
Path: utzoo!watmath!clyde!burl!ulysses!mhuxl!houxm!houxz!vax135!cornell!
uw-beaver!tektronix!hplabs!ames!jaw
From: j...@ames.UUCP (James A. Woods)
Newsgroups: net.unix-wizards,net.research
Subject: Lempel-Ziv Compression Made Swift, part 2
Message-ID: <426@ames.UUCP>
Date: Thu, 26-Jul-84 00:40:42 EDT
Article-I.D.: ames.426
Posted: Thu Jul 26 00:40:42 1984
Date-Received: Sat, 21-Jul-84 03:40:34 EDT
Organization: NASA-Ames Research Center, Mtn. View, CA
Lines: 83

#  "They squeeze the orange and throw away the skin."

	-- Voltaire, 1751 (letter referring to quarrel with Frederick the Great)

     Last episode, I remarked on an alternative implementation of LZ coding.
I proposed a rather memory-intensive structure inherently faster than hashing
or linked lists.  The tests, by the way, were done on a medium-loaded (10 users)
4 megabyte 11/750 running 4.2 BSD.  For all of the program runs, there were
only a few (<7) page faults across all compressor types.  Lesson:  don't let
the high I/O usage stats reported by the "time" command scare you -- the
bottom line is real time.

     What is the use of FAST compression, you say?  Phone lines only run at
9600 baud or so.  We've got a > 50% method in LZW, so why mess with it?
Good questions.

     But, there are applications beyond squeezing blood out of net.flame
over modems.  Even here, Thomas's code won't work on small address space
machines for N > 12.  This brings up the point that maybe Lempel-Ziv shouldn't
be wired down into a utility until it works for the PDP 11 (or IBM PC/XT!) uucp
sites at reasonable rates.  Maybe someone should look into hashing with open
addressing instead of memory consuming chaining -- there's lots of interesting
time/space tradeoffs to investigate.  I rejected hashing initially because
of a slow malloc(), and the fact that computing good hash functions in software
generally requires more time than scanning characters or links (c.f.
discussion of hashing vs. tries [for that nifty "on-line" spelling checker]
in Aho/Ullman's Data Structures and Algorithms.)  But, "tries" have their
own problems ...

     Anyway, I think there's a niche for fast code, and, at a minimum,
I'm intending to #ifdef my speedups into Spencer's code along with
mcvax!jim's portability mods (oops, I hope he used masks rather than
assuming the existence of a barrel shifter), so a VAX (or 68K/32032/RISC)
can communicate with the dinosaurs at BITS = 12.  While we're at it,
why not use the Berkeley "compact" command line interface (filename extensions
for friendly in-place file compression in addition to pipe capability),
and it's name (we all agree that the algorithm is a turkey).  That way,
we don't need more manual pages for compress/squeeze/squish/squash --
just do surgery on compact's guts and give it back to UCB.
Also, if we're going to let users play with compression rates, that
magic number better have the BITS parameter, for idiot-proof decompression.

     As for benchmark files, try the 200K /usr/dict/words, or System V
with the extra "carburetor" but no "tofu".

     Some other applications:

(1) Macintosh screen compression.  Run length coding accessed from the
    Toolbox gets pictures down to 10K or so, but maybe some form of LZ can
    postprocess this stuff even more for those tiny diskettes.

(2) Backup tapes.  Welch reports 2-3 X savings system wide.  Let's see,
    "tar/dump" goes about 3 meg / minute = 50K bytes / second at 6250 density.
    Hmm, we don't have a time payoff unless we're running a mainframe
    mismatched with slow tapes.  But it'd still be fun to wire an
    assembly-coded version (no memory contention running single-user, afterall)
    into dump/restore.  One tape is still better than two, albeit sluggish.
    I predict the LZ method will be put (invisibly) into tape controllers
    by the Hardy boys.

(3) Scientific data.  Lempel-Ziv buys 3 : 1 for ANOVA columnar statistics
    and certain random-access data bases (containing mostly zeroes) run at
    8 : 1!  We never did want those removable drives the division is pushing,
    anyway.

(4) Slow networks.  What does 4.2 TCP/IP over Ethernet do?  50K bytes / second?
    Or have that Cray/Denelcor UNIX ship compressed data to its frontend,
    or to its slow disks.

	-- James A. Woods  {dual,hplabs,hao}!ames!jaw  (j...@riacs.ARPA)

P.S  Data compression has obsessed me since days as a kid in grad school,
     where I did music coding (adaptive transform coding on an 11/40
     with no floating point, running UNIX Version (not System) 5.
     If you're going for optimality, try Chaitin/Kolmogorov universal
     machine complexity theory (except it's worse than NP-complete, being
     undecidable).  Some variants of the linear LZ are indeed NP-complete
     (Storer/Szymanski [author of pack?!?] at Bell Labs, are you there?)
     But for practicality, hats off to Lempel and Ziv.  For the first time
     in 20-odd years, we have something to beat the pants off of Huffman
     codes (perhaps the most famous master's thesis of 1952).

     In a way, it's like besting Bob Beamon's long jump.  Tres recherche!