In essence, file-compression programs simply get rid of the redundancy. Instead of listing a piece of information over and over again, a file-compression program lists that information once and then refers back to it whenever it appears in the original program.


Most compression programs use a variation of the LZ adaptive dictionary-based algorithm to shrink files. "LZ" refers to Lempel and Ziv, the algorithm's creators, and "dictionary" refers to the method of cataloging pieces of data.


For the following sentence:

    "Ask not what your country can do for you -- ask what you can do for your country.",

the dictionary will be:

    1 ask

    2 what

    3 your

    4 country

    5 can

    6 do

    7 for

    8 you

and the compressed sentence will be:

    "1 not 2 3 4 5 6 7 8 -- 1 2 8 5 6 7 3 4"


While the example above conveys the raw idea of how compression is done, in reality, the compression programs are not splitting source data into words, but rather look for patterns.  To achieve good compression ration, not all patterns will become dictionary items.  Instead of the dictionary above, it's could be:

    1 ask__

    2 what__

    3 you

    4 r__country

    5 __can__do__for__you

and the compressed sentence spelled as:



Now you can see why text files and computer program's source code compresses well (high rate of redundancy), and graphics and MP3 files do not (include a lot of unique information).


Some programs are particularly suited to picking up patterns in certain types of files.  For example, XML compresses very well with gzip.


If the compression/decompression algorithm lets you recreate the original file exactly, it's called lossless compression.  Lossy compression eliminates "unnecessary" bits of information.  This type of compression is used a lot for reducing the file size of bitmap pictures.  While large parts of the picture may look the same -- the whole sky is blue, for example -- most of the individual pixels are a little bit different.