Working with Zip Files in .NET [Richard Lee]

Working with Zip Files in .NET [Richard Lee]

Rate This
  • Comments 38

Before getting started, I’ll introduce myself. My name is Richard Lee, and I’m a developer intern on the BCL for the summer. I’ve only been here for a few weeks, but it’s been great working here. The people, the environment, and my project are all great. Speaking of which, my project is to add general purpose .NET APIs for reading and writing Zip files, which we’re considering adding to the next version of the .NET Framework.

The most common Zip tasks are extracting to a directory and archiving a directory. For these mainline scenarios, we have static convenience methods. The following code takes all of the files in the Zip file, photos.zip, and extracts them to a folder on the file system:

ZipArchive.ExtractToDirectory("photos.zip", @"photos\summer2010");

This code does the reverse, putting all of the files in the folder into the Zip file:

ZipArchive.CreateFromDirectory(@"docs\attach", "attachment.zip");

For more sophisticated manipulations of Zip archives, there are two main classes. ZipArchive represents a zip archive, which is a collection of entries, and ZipArchiveEntry represents an archived file entry. The following code extracts only text files from the given archive.

using (var archive = new ZipArchive("data.zip"))
{
    foreach (var entry in archive.Entries)
    {
        if (entry.FullName.EndsWith(".txt", StringComparison.OrdinalIgnoreCase))
        {
            entry.ExtractToFile(Path.Combine(directory, entry.FullName));
        }
    }
}

Zip archives can also be created on-the-fly. This example creates a new archive with a readme file that is created without the need for a corresponding file on disk, and a file from the file system.

using (var archive = new ZipArchive("new.zip", ZipArchiveMode.Create))
{
    var readmeEntry = archive.CreateEntry("Readme.txt");
    using (var writer = new StreamWriter(readmeEntry.Open()))
    {
        writer.WriteLine("Included files: ");
        writer.WriteLine("data.dat");
    }

    archive.CreateEntryFromFile("data.dat", "data.dat");
}

The ZipArchive class supports three modes:

  1. In Read mode, data is read from the file on demand, using only a small buffer.
  2. In Create mode, data is written directly to the file using only a small buffer. Only one entry may be held open for writing at a time.
  3. In Update mode it is possible to read and write from existing archives, as well as rename or delete entries. This mode requires loading the entire archive into memory, and as such we recommend that it be used only with small archives when this functionality is needed.

Below is our current thinking on what the public API listing will look like (note that this hasn’t been finalized yet).

namespace System.IO.Compression
{
    public enum ZipArchiveMode { Read, Create, Update }

    public class ZipArchive : IDisposable {
        // Constructors
        public ZipArchive(String path);
        public ZipArchive(String path, ZipArchiveMode mode); 
        public ZipArchive(Stream stream);
        public ZipArchive(Stream stream, ZipArchiveMode mode);
        public ZipArchive(Stream stream, ZipArchiveMode mode, Boolean leaveOpen);

        // Properties
        public ReadOnlyCollection<ZipArchiveEntry> Entries { get; }
        public ZipArchiveMode Mode { get; }
        
        // Instance methods
        public ZipArchiveEntry GetEntry(String entryName);
        public ZipArchiveEntry CreateEntry(String entryName);

        public void Dispose();
        protected virtual void Dispose(Boolean disposing);

        public override String ToString();

        // Instance convenience methods
        public ZipArchiveEntry CreateEntryFromFile(String sourceFileName, String entryName);

        public void ExtractToDirectory(String destinationDirectoryName);

        // Static convenience methods
        public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive);
        public static void CreateFromDirectory(String sourceDirectoryName, String destinationArchive, Boolean includeBaseDirectory);

        public static void ExtractToDirectory(String sourceArchive, String destinationDirectoryName);
    }

    public class ZipArchiveEntry {
        // Properties
        public DateTimeOffset LastWriteTime { get; set; }
        public String FullName { get; }
        public String Name { get; }
        public Int64 Length { get; }
        public Int64 CompressedLength { get; }
        public ZipArchive Archive { get; }

        // Methods
        public Stream Open();
        public void Delete();
        public void MoveTo(String destinationEntryName);

        // Convenience methods
        public void ExtractToFile(String destinationFileName);
        public void ExtractToFile(String destinationFileName, Boolean overwrite);

        public override String ToString();
    }
}

We would love to hear what you think of the APIs so far, and how you plan on using them.

  • I would make the API based on Stream objects instead of working with files directly. There are a lot of cases where you need to zip something in memory, e.g. for writing to a database or to a network socket.

  • It's unclear from the API whether you'd be able to add an empty folder to the zip file, which could be useful. I second being able to feed a stream into an entry in addition to a file.

  • The convenience methods would be much more convenient if they took a standard file mask as the third argument (*.png, *.jpg, etc.).

    Also, not to poo-poo a nice addition, but perhaps support for other compression formats?  At least in the form of creating an abstract base class "Archive" that ZipArchive implements for Zip.  That way we could derive and create for example SevenZipArchive and RarArchive.

  • It's not clear to me what ZipArchiveEntry::MoveTo() is used for, is it used to reuse the object, but point it at a different file in the archive?   That seems messy to me, and problematic from a usage standpoint.

  • I definitely +1 the base class or interface suggestion. Zip archives are great and widely supported but it'd be nice to have the flexibility to code implementations for other formats and have a common base.

    Also, I think ZipArchiveEntry.Open should take some kind of FileMode flag to specify what kind of operations are to be supported. Other than that, great! It'll be a nice addition to the framework.

  • I also agree with the stream comments.  Also, will the ZipArchive class support multiple zip compression methods, or will this be a “Compressed Folder”?  It would be nice if it supported multiple methods and the ability to specify them, including the support of zip encryption methods.  I work in health care, and this would be a great feature of the API to help comply with security regulations.

  • SharpZipLib is pretty good, but it'd be great to have structured ZIP support as part of the framework (obviously stream based as above comments; file stuff could just be convenience/extension methods).

  • For the Update mode, I'd prefer to see it based around a streaming approach: that is, having some way of having the archive buffer delta operations, then apply the deltas en masse as a transaction, reading from the source zip and writing to an output zip, with fixed memory usage. But such a lazy / delayed architecture may need a parallel / shadow API to distinguish it from the immediate mode of normal reading operations. You could also consider a log-peeking approach; that is, in your Update mode, you can still read e.g. a file that you just added, but internally it would read the data from the log of deltas, not from the underlying zip file (which wouldn't have been updated yet).

    I am presuming that you already have Stream overloads in mind, as restricting operations to files on disk would be ridiculously limiting. Both the ZIP itself and the entries should be readable and writable through Stream. For robustness, I'd consider writing these streams out to temporary files on disk to avoid them hogging memory when Updating or Adding - an appending Add mode ought not require as much work of rewriting.

    On support for other archive formats, I would advise not being too aggressive with base classes or interfaces, as the risk of over-abstraction, overengineering, interfaceitis, etc. are high. Actually, I would make your zip classes sealed with no general base class or interface at all *until* you have at least two alternate implementations, and the best approach for interface parity is clear.

    Finally, it would be nice to have a .ZIP format with at least rudimentary support for rewriting the zip index, the bit at the end of the zip file, by scanning through the zip entries.

  • Please, please don't forget testability. The base class / interface suggestion would be ideal - if you're hard set on static convenience methods just don't lock the rest of us out that unit test our code heavily and use mocking frameworks. It would be nice, for once, to be able to use a BCL offering without having to wrap it in something that allows me to test my code without resorting to things like TypeMock - example: DateTime.UtcNow.

    Also, thanks for being transparent about the design process - good to see!

  • sevenziplib.codeplex.com

    Although I haven't looked into it, their examples seem to have the right "feel" to them. I do like the fact that LINQ can be used.

    I second (third, fourth, whatever) an open architecture so that concepts like finding the internal directory, encoding and decoding the contents of an elements, and so on, are exposed. IOW, have a CompressedArchive abstract class, of which traditional .zip files are processed by one set of child classes, but other formats (.cab, .rar, etc) could also be implemented (presumably by third parties) and then processed with the same set of APIs.

    Oh, and make sure that long (32K) path names inside the files are supported.

    Also, you don't say what exceptions you might throw. For example, what if the CRC (which you don't have an API to retrieve) doesn't match? And how does CRC mismatch fit into the suggestion for an in-memory Stream retrieval of the data? As much as I like the Stream idea, I don't like the idea of getting to the end of the stream (having updated files/databases/whatever), then finding out that the CRC didn't match and perhaps all the data up to then was flawed.

  • The ExtractToFile 'convenience' methods on ZipArchiveEntry seem out of place and overly specific to me.

    Personally I'd rather see a more generalized utility methods on the System.IO.File static class, e.g.

    File.WriteAllBytes(string path, Stream stream)  // reads from 'stream' and writes to the file at 'path'

    File.WriteAllBytes(string path, Stream stream, FileMode mode)

    Which could then be used to replace zipArchiveEntry.ExtractToFile(filename) as follows:

    using (Stream s = zipArchiveEntry.Open()) {

       File.WriteAllBytes(filename, s);

    }

  • This approach is too specific because there are plenty of other formats:

    en.wikipedia.org/.../List_of_archive_formats

    It would make more sense if ZipArchive is a subclass of Archive.

  • I agree with @Dominik -> having an IArchive (or even Archive abstract base class) that ZipArchive, 7zArchive, TarBallArchive, RarArchive (and others) can all implement would be great. Also, something like a CompressedStream might be needed for archives that only have a single stream.

  • I strongly concur about the suggestions for:

    * making it more Stream-based rather than File-based;

    * deriving from an abstract Archive class (you can choose a better name).

    Since zip files store file names as an array of bytes whereas .Net uses utf-16 for strings, I believe you should specify how you will handle encodings and provide a way to override the defaults (e.g. force encoding of file names to utf-8, or windows-1252).

    Please check your compression ratio compared to other zip implementations.  The current DeflateStream class might need some improvements.  I'm not asking that you match 7zip's deflate or kzip, but you really should compress as well as e.g. Info-zip (with -9).

    Convenience APIs for creating a Package from a ZipArchive (or, better, from an abstract Archive class) would be nice.

    A set of PowerShell cmdlets would be nice, too.

  • Congrats on the new job Richard.

    I do quite a lot of Silverlight development and I often send data from the server side to the client side and this data compresses very well so it's ideal to gain that benefit. Problem is that the System.IO namespace for Silverlight does not include functionality to uncompress data that was compressed by .NET server side.

    So, I'd very much like to see some for of compression + decompression support for the Silverlight BCL.

    Currently I use 3rd party assemblies for both .NET and Silverlight which is annoying because I've run into a very frustrating scenario where the current compression functionality in .NET is incompatible with the 3rd party SharpZipLib. Not sure whose to blame... dont' care... but these things should be more standardised.

    Hope that helps

Page 1 of 3 (38 items) 123