Last week there was a bit of a controversy surrounding somebody’s bad experiences with .NET Serialization. Getting intimate with Serialization is a bit like any other relationship – Sometimes good, sometimes bad, every now and then just incomprehensible. And you better not forget to buy flowers (or define attributes) every now and then. And a more specifically:
One of my recent projects encapsulates nearly all the serialization issues discussed:
Background: a certain subsystem was responsible for processing mainframe data files and generating printable output. The output could be saved in a variety of formats including SVG and PDF. The in-memory representation of the output data was as a collection of objects, each representing a graphic primitive such as a string, a line, rectangle, image, etc.
The problem: With very large input files (>10 megs), processing time became unacceptably long. After profiling the core engine we determined that for a 10MB file, the in-memory object graph would be very close to 100,000 objects and the serializing/deserializing of that graph could take over 30 seconds. Why would we want to serialize it? Well, one reason was to cache the output to disk or database. Another reason was that the application included several AppDomains, and each domain boundary crossing involved serializing on one side and deserializing on the other.
Attempt #1: The first attempt at a solution was to implement ISerializable in the graphic primitive objects and let each object serialize itself. This added roughly 100 simple lines of code to the application. The good news was that it resulted in a major performance improvement – serialization time was reduced by half. The not-so-good news was that the serialized stream was almost twice as big as before. Why? ISerializable.GetObjectInfo works by persisting data into an instance of a SerializationInfo object. In order to allow reading the data back out, SerializationInfo is implemented as a key-value pair collection, and the key is persisted along with the data. The end result is that I was not only saving the properties of 100,000 graphic primitives (the values in the SerializationInfo collection), I was also saving an identical number of keys.
Attempt #2: I decided to stick with custom serialization and the speed improvement it brought, but to try to reduce the stream size. So I modified the ISerializable.GetObjectData to pack objects into a bye array (using a BinaryWriter) and save the byte array into the SerializationInfo object. That worked – The serialized data was comparable to default serialization, and the process was twice as fast. However twice as fast still was not fast enough for us, and also the code had become a bit cumbersome.
Attempt #3: Additional profiling indicated that the serialization framework was spending a lot of “dark” time… Time spent before, after and in between calls to our GetObjectData code. Iterating through the object was quick (<1 sec) when we did it in code, however the same iteration took the BinarySerializer literally tens of seconds, excluding the time spent in our custom GetObjectData serialization code. We also experimented with partial de-serialization, in which tried to deserialize only the first n objects in the stream. But we ran into a similar limitation – 99% of the deserialization time was spent before reaching the first call to one of our objects. What was the framework doing there? We didn’t know and really didn’t have any resources left to investigate further. So, we moved on to:
Attempt #4: We refactored the graphic primitives object structure in such a way as to create a single parent which would be responsible for serialization and deserialization of the whole object hierarchy. So basically that corner of the app does its own “serialization thing” independently of .NET serialization, with one parent object acting as a gateway between the two. This allowed us to create a custom algorithm which could persist the object graph to a stream in under a second.
That probably wasn't the best way to solve our problem, but given our time constraints we took the first solution which presented itself.
By end of the optimization effort, we had reduced a 2 minute 45 second process down to 8 seconds, and about 30% of the improvement was due to the changes to serialization.