More on Zip in .NET [Richard Lee]

More on Zip in .NET [Richard Lee]

Rate This
  • Comments 11

First, I’d like to thank everybody for their comments on the Zip APIs. It’s great to know that I’m working on something that a lot of people will hopefully find useful. I’ll try to address the themes that came up in the comments.

Streams

A lot of the comments mentioned support for streams. The API does support creating a ZipArchive with a stream as the backing store, and input from streams. The following code sample takes the contents of instream and writes a Zip archive to outstream containing just that one file:

using (ZipArchive archive = new ZipArchive(outstream, ZipArchiveMode.Create))
{
	ZipArchiveEntry entry = archive.CreateEntry("data.dat");
	using (Stream entryStream = entry.Open())
	{
		instream.CopyTo(entryStream);
	}
}

This will write out the Zip archive directly to outstream, without buffering the entire contents of the archive in memory or writing to a temporary file. We really think of the methods used above as the core APIs. Methods like CreateEntryFromFile and ExtractToDirectory are purely convenience methods – their main purpose is to make some of the more common scenarios with files easier.

Compression and Encryption

Another common theme was custom encryption and compression algorithms. The vast majority of Zip archives that are meant to be interoperable with the widest range of libraries, tools, and applications use the Deflate compression algorithm without encryption. Our main goal for this API is to be able to read and write such archives. As such, we’re currently planning to support writing Zip archives with Deflate, reading Zip archives that use Deflate or no compression, and to not support encryption. Not only will this enable reading/writing interoperable Zip archives, it also scopes the work to something reasonable that can be delivered during my internship.

If we’re not providing built-in support for additional compression or encryption algorithms, an obvious question, then, is why not provide extensibility hooks so that custom compression or encryption algorithms can be plugged-in? We explored doing this and it turns out it’d be more complex than you might initially think.

A lot of people mentioned the CryptoStream and ICryptoTransform model as a powerful way to allow for extensibility. This works well because the ICryptoTransform only needs to do one thing – transform a series of bytes into another series of bytes. Unfortunately, encrypting Zip files is a much more complex operation. Fields and flags need to be set to appropriate values and headers need to be encrypted depending on which algorithm is used. Implementing either of the two secure encryption methods mentioned in the Zip specification would require access to essentially all of the metadata in the Zip file. The resulting interface for such an extension would be enormously complex.

Compression is a bit simpler, as only one entry is compressed at a time. However, there are still fields and flags that need to be set in the headers, depending on the algorithm used. Furthermore, for both encryption and compression, only a few compression/encryption methods are specified in the Zip specification. We don’t want to give the impression that any compression method can be used to produce a Zip file, when that isn’t the case. If you want to use a compression or encryption algorithm that isn’t specified in the Zip spec, you might as well just compress or encrypt the stream yourself.

So an extensibility interface would be substantially more complex than something like ICryptoTransform, and add significant complexity to the ZipArchive/ZipArchiveEntry class. We don’t think providing extensibility hooks in this way adds enough value to the API to justify the added complexity that they bring. However, we are open to adding built-in support for encryption or other compression algorithms in the future, based on customer demand. We’ve specifically designed the API in a way that would allow us to do this. If this is something you’re interested in, we’d love to better understand your needs.

Abstract Base Class

Another common question was around providing a base class for Archives. We explored this and have decided not to add such an abstraction at this time. The main reason we are holding off is because right now we’re only planning to provide support for Zip archives and have no plans to support other archive formats (such as CAB, RAR, etc.). As a design principle, we try to avoid adding abstractions when there is only one implementation, otherwise we risk getting the abstraction wrong. Then we’re either stuck with the bad abstraction or are forced to add another when adding another implementation. All of this adds additional complexity that we’d like to avoid.

The concerns about test-driven development are certainly valid, but there are ways around this. For example, because archives can be made with streams as the backing store, a FileStream could be mocked and put behind an archive. That may not be totally ideal, but these concerns don’t seem compelling enough to justify adding an abstract base class at this time.

Miscellaneous

There were a couple comments about treating the Zip archive like the filesystem, and supporting searching for certain kinds of files. We made the decision to treat the archive as a flat container of files because that is how they are actually stored. As a library rather than an application, we thought it made more sense to represent the archive as it exists on disk. Also, using LINQ on the entries means getting all of the files in a certain subdirectory that end in .txt is a relatively simple operation.

Another interesting comment was that MoveTo was confusing. MoveTo was intended to act like Rename, for renaming entries but keeping everything else about them the same. The method was named MoveTo to mimic the naming for the method on FileInfo. However, because it is confusing and probably has very few compelling usage scenarios, we’re thinking of cutting it from the API.

I hope I addressed some of your comments. We’d love to keep hearing from you, especially if you have extremely compelling use cases for some of the features that we’re not including in this version of the API. Being the designer of the API, I think it would be really cool, too, to support custom compression and encryption, or some of these other features. But at the risk of sounding like a broken record, I’m trying to build a simple, usable API, and these decisions are made with that in mind.

  • Something extremely important to us that has not been mentioned yet: please make the API available to both .NET and Silverlight

  • "The concerns about test-driven development are certainly valid, but [...] these concerns don’t seem compelling enough to justify adding an abstract base class at this time."

    :(

  • Hello Richard,

    Would be nice if this API provides the critical operations in async fashion.

  • No abstract base class   :(

    Could you please tell us how you plan to handle encodings (character sets/code pages) for the file names?  AFAIK, this is not really specified in Zip (it's a byte string), and a working Zip implementation should be able to handle Zip files created in other countries.

  • While I appreciate that you have to do this in a short period of time for your internship, I feel that this post is a bit of a blow off.  While you provide very good detail in addressing the concerns of the community, you didn't throw us any bones; you asked for comment, the community provided its thoughts, then you essentially said "thanks for your comments, but we're not actually going to change anything based on your feedback".  Then why even ask for the feedback in the first place?

    My biggest need for this class is to support encryption.  You stated that the most common type of compression is using Deflate without encryption, and overall, I don’t doubt that.  However, working in Healthcare, this is definitely not the case.  If this class can’t support encryption, then I can honestly say that I can’t use it for many scenarios (true, I could encrypt files prior to compression, but what about the files we get from other people that use true compression encryption?).  I think you should go have a chat with your Health Solutions Group about their needs for security in this type of a class.  They should be able to provide you with some great use case scenarios.

    AFAIK, if you don’t have the time to do it right (some of the comments/concerns were brushed off as “I don’t have time because I’m in an internship”), then don’t do it at all.  This is the base class library and we need robust solutions that we as developers can build from.  Keeping the class ‘simple’, as you say, feels a bit like a shield because it seems like simplicity for simplicities sake.  While this project has no doubt been a good learning opportunity for you, I think it needs to be worked further by someone with more time to dedicate, prior to possible inclusion in the libraries.

    Critique aside, I think that you have a very good start to a base class, and your writing style is very well thought out.  I look forward to using your class some day (whith encryption :P ).  Great job!  Best of luck with your career.

  • Perhaps this "class" shouldn't be included in the BCL, but just a CodePlex project?  I agree with the other posters "we only have the time of my internship" sounds like a HORRIBLE excuse for something that is going to be added to the BCL.

  • I wholeheartedly welcome your decision regarding the base class issue. Better have a simpler API with something that works than to have abstract base classes all over the place that don't add much benefit, but make the whole BCL harder to understand.

    Regarding compression and encryption, an un-encrypted ZIP implementation which uses Deflate is fine for me, so thumbs up! :-)

    Best regards and all the very best for your internship!

    Ooh

  • Richard, while still on streams, i think it will be nice to be able to access the items in the archive using an index. The index may use an unsigned int, or the filename that represents the file. This will then be accessed as a stream which you can copy to or replace that particular stream / rename. This makes the API easier to work with.

  • >>> "we only have the time of my internship"

    It sounds like this isn’t coming across the way it was intended.  Let me try to clarify.  In a perfect world, we’d love to be able to provide all the functionality that all customers would like in an API.  Unfortunately, we can’t do everything because even we have resource and time constraints.  This is true regardless of whether this is initially developed as part of an internship or any other internal development milestone.  We scope the feature to something that will satisfy the end-to-end needs of the majority of customers in order to deliver a high-quality, well-designed API.  If we try to bite off too much, we risk shipping a lower-quality API, to the detriment of everyone.  We’ve tried to design the API in a way that would allow us to add additional functionality—such as encryption—in the future, based on customer demand.

    One important thing that is worth calling out is that we will continue to gather feedback and make appropriate changes even after the completion of Richard’s internship.  We’re still in the early stages of our next release and one of the main reasons for discussing the design of this feature on the blog this early is specifically to get early feedback from you, so that we can make changes before it ships in the framework.  We are also planning to release the Zip APIs on our CodePlex site, allowing you to actually make use of the APIs and provide additional feedback.

    If there is overwhelming customer demand for certain functionality or we missed an important scenario, we’ll do our best to address the feedback.  This could involve trading off other planned feature work in other areas in order to spend the time adding the functionality to the Zip APIs.  This may also involve making tweaks to the Zip API if we discover an issue that would prevent us from (or make it difficult to) adding the functionality down the road.

    I hope this helps.

    Regards,

    Justin Van Patten

    Program Manager

    Base Class Libraries

  • Why are you working on this? Is Xceed Zip for .NET not a great solution already? I just don't get it. Where is the pressure coming from to put this in BCL?

  • Or for streams, Xceed Real-Time Zip for .NET and for Silverlight... We've gotten great kudos on a clean, efficient API. This problem has been solved already...

Page 1 of 1 (11 items)