12.1: Techniques for Compressing Text or Storing Records

Many important techniques are available for the compression of text. Text compression is important when dealing with large collections of information, since internal memory is scarce and access to external memory is slow. Huffman coding is an elegant method that can be used profitably when characters do not appear with equal frequency in a text. This is the case with the letters of the alphabet in English text.

Binary trees can be used as the basis for encoding text and for the creation of Huffman codes. Binary search trees can also be used to generate more quickly codes of comparable efficiency.

The application of "simple" and balanced binary search trees for record storage and retrieval by key was demonstrated in Chapter 9. As shown there, "simple" binary search trees provide excellent average search times with random input when keys are accessed equally often. Also, AVL trees guarantee excellent worst-case search time. We shall see in this chapter that optimal binary search trees are useful when the keys have known, even unequal, frequencies, and when the trees, once grown, remain unchanged. Words in English text occur with known and unequal frequencies. An optimal binary search tree could be applied, for example, as a dictionary storing the 500 most frequently used English words, along with their definitions and their common misspellings. A search of the tree would then yield the definitions and misspellings. Once built, this tree would remain unchanged.

This chapter shows that the problems of finding Huffman codes and of finding optimal or nearly optimal binary search trees are related. The solutions illustrate the use of trees, heaps, and list-structures in the construction of good algorithms. The chapter also introduces the "greedy" heuristic. This method leads to an optimal solution for finding Huffman codes and to a nearly optimal solution for finding binary search trees. In both cases, the proper selection of data structures plays a significant role.