System.IO.Compression Capabilities [Kim Hamilton]

System.IO.Compression Capabilities [Kim Hamilton]

  • Comments 16

We often get asked about the capabilities of the .NET compression classes in System.IO.Compression. I'd like to clarify what they currently support and mention some partial workarounds for formats that aren't supported.

The .NET compression libraries support at the core only one type of compression format, which is Deflate. The Deflate format is specified by the RFC 1951 specification and a straightforward implementation of that is in our DeflateStream class.

Other compression formats, such as zlib, gzip, and zip, use deflate as a possible compression method, but may also use other compression methods. In the case that they use deflate, you can think of these formats as a wrapper around deflate: they take bytes generated by deflate compression and tack on header info and checksums.

Our GZipStream class does exactly that – it uses DeflateStream and then adds header info and checksums specific to the gzip format. The gzip format is specified in RFC 1952.

So, out of the box, we support deflate and gzip formats.

Until we provide support for the other formats, which we plan to do soon, there are partial workarounds that may help you out in some situations, but they're definitely not a complete solution.

Working with zlib

The zlib format is specified by RFC 1950. Zlib also uses deflate, plus 2 or 6 header bytes, and a 4 byte checksum at the end. The first 2 bytes indicate the compression method and flags. If the dictionary flag is set, then 4 additional bytes will follow (which explains why the header will be 2 or 6 bytes). Note that in the wild, preset dictionaries aren't very common (and our classes don't support them).

This diagram from RFC 1950 shows the zlib structure:

           0   1
         +---+---+
         |CMF|FLG|   (more-->)
         +---+---+


      (if FLG.FDICT set)

           0   1   2   3
         +---+---+---+---+
         |     DICTID    |   (more-->)
         +---+---+---+---+

         +=====================+---+---+---+---+
         |...compressed data...|    ADLER32    |
         +=====================+---+---+---+---+

This means that to read a zlib file using only the .NET libraries, you can often just chop off the first two bytes and 4 end bytes and use DeflateStream on the rest of the stream as normal. (It would be better to check the dictionary bit and not attempt to read anything in that case).

Going in the opposite direction isn't as trivial, so I'm not really suggesting to generate zlib files this way. However, a couple people have asked in the past so I'll sketch an overview of that.

To start, you need to know which bytes to add at the beginning. With our deflate implementation, those bytes are 0x58 and 0x85. If you're curious about how this is derived from RFC 1950, see section 2.2 "Data format" and note that we use a window size of 8K and the value of FLEVEL should be 2 (default algorithm).

After that, you need to add the Adler-32 checksum at the end. The checksum will depend on the payload that you're compressing so you need to calculate it programmatically. Because of this, the easiest way to generate the checksum is to subclass DeflateStream and override the Write/BeginWrite methods to update the checksum. Steven Toub's NamedGZipStream article (mentioned at the end) shows an example of creating such a subclass for generating named gzip files.

Working with other compression formats

The big format you're probably thinking about is zip. Currently the .NET libraries don't support zip but the J# class libraries do. The following article describes using these libraries with a C# app.

http://msdn.microsoft.com/msdnmag/issues/03/06/ZipCompression/default.aspx

But if you don't want to rely on the J# class libraries, we'll need to provide a better solution.

Now that you're familiar with some compression specifications, let's focus on zip a little more. A zip specification is here:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Notice that zip also allows deflate. Again the same principle applies – there are deflate bytes packaged in a header and footer. This may tempt you into writing a zip reader/writer based on DeflateStream (as described above for zlib), but there are two key differences that make zip more complicated.

First, the zip header contains a lot more information than the zlib header. To read a zip file, you'd definitely have to parse the header to figure out how many bytes to skip over because the header contains variable length items such as a file name.

Second, zip tools actively use different compression methods. For example, use Windows compression tool on a very small text file (with just a few words in it) and then a bigger file, say around 20 KB. Chances are it used no compression (yes, that's an option) for the small file and deflate for the 20 KB file.

Because different compression methods are used, an extension of the zlib technique described above may not help you much if you want to use the .NET libraries to read zip files. You'd definitely have to read the compression method to determine how to proceed. If it's deflate, then chop off the header and proceed as above. If it's no compression, chop off the header and read the bytes as a normal stream of bytes. If it's something else, then the .NET libraries have no built-in support for it.

Additional Note: Using WinZip with our GZipStream

Steven Toub observed in an MSDN article that WinZip can't handle our GZipStream because it requires filename info. He's created a NamedGZipStream implementation that generates files readable by WinZip

http://msdn.microsoft.com/msdnmag/issues/05/10/NETMatters/

Our Future Compression Plans

We'd like to address the shortcomings of our compression library in future releases. The following items are our highest priority compression requests:

  • Support for more formats, such as ones described above
  • Better compression ratio
  • Better compression speed

Are there any others you'd like us to address?

  • Support the new format ZIP files that allow >4GB (Both the new WinZip & PKWare formats) and AES Encryption.

    Support GZIP files >4GB (This would be a simple bug fix).  There should be no limit on how big a gzip file can be.

  • Other formats :

    Bzip2 format - patent free, better compression than zip & gzip.

    RAR Format

  • Please support LZMA which is the algorithm used in the 7z or 7-Zip format.  Its faster than zip with higher compression ratios.

  • I second the 4GB limit problem.  I bugged it at https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=94784 over a year ago...

    Stronger GZip compression would be nice, too...

    If you can do both of these, I won't have to use the open-source SharpZipLib anymore :)

    BZip2 support would be welcome.

  • Can't believe you didn't mention the horrible 4GB limitation!  This really needs fixing because at present solutions are getting developed in .NET that explode without warning in production when files get too big (I know from experience...)

  • I would love to see the RAR format implemented.

  • Fixing the limit and RAR support would be brilliant.

  • Are the team going to ensure that classes in this namespace can be used under partial trust ASP.NET?

    Kev

  • Compressing a directory is an extremely common use case.  I would love to be able to do this in a line or two of code.  Even if it is not appropriate for all compression formats I believe the tradeoff is worth it.

  • What about the self-extracting feature?

  • Why point people to Java when they can use #ZipLib?

  • Thanks everyone, this is great feedback. Some notes, and first a very important clarification:

    Adi - are you trying to get my blogging license revoked? :) I can see it on slashdot now: "CLR developer encourages users to switch to Java."

    Very important to clarify -- I'm encouraging people to check out the technique shown in the MSDN article using the J# libraries, which are a Microsoft product :), and can be used from C# apps...also available courtesy of Microsoft (and Anders et al)...which hopefully they're editing in Visual Studio... You get the point.

    About 4GB support: yes, this has definitely been in our plans too -- sorry I left it off the list. But it's great to know how high this is in relative priority!

    Kev - just to back up a second, the compression classes are transparent; security issues are pushed to file open and creation. E.g. if you can open a file then you can compress or decompress, because the compression classes only deal with the file as a stream. But the file open/creation part is the part that could block an ASP.NET app. We haven't planned any changes to this security model so far, but if there are any particular scenarios you're interested in, let us know.

    Brent - yes, we'd include the ability to compress directories along with zip support.

    Formats - While we can't say for certain exactly which formats we'll include in the near timeframe, these replies have given us an excellent sense of which ones people want to work with.

  • Fastest compression on most types of data:  LZO/NRV (oberhumer.com).  QuickLZ is maybe even faster, but it's not as "proven" as LZO/NRV.

    These are open source, but there are commercial licenses available, and I'm sure the authors wouldn't mind their compression algorithms being included in the .NET BCL. :)

  • The 4GB issue is definitely top priority.

  • I'd like the support for using Stream to read/seek over ISO9660/UDF (cd images), RAR. Writing wouldn't matter so much because:

    Currently you need to install all kinds of drivers and stuff to deal with filesystem images and as seen from Month of Apple Bugs that's one area with a lot of potential for escalation exploits. With ability to easily deal with the images through .net and powershell in fully managed way you would have both security and ability to easily do processing over images in remote servers.

Page 1 of 2 (16 items) 12