Users have frequently wanted the ability to remotely synchronize relational nodes in a peer to peer fashion using SyncServices for ADO.NET. We have a sample demonstrating remote synchronization by using Windows Communication Foundation (WCF). I wanted to use this post to provide an easy way to optimize the performance of this WCF based solution. DbSyncProvider enumerates all changes in a DataSet and this gets applied on the destination provider. The moment the two providers are moved from a local 2-tier model to a remote n-tier model this DataSet has to be serialized and transmitted over the network. DataSet’s are very efficient when the amount of data involved is very small. The moment your data exceeds few hundred rows the amount of memory required to serialize/deserialize this DataSet is quite huge. Further the serialized size of the DataSet on disk is quite big as well.
DataSet object is by default serialized in XML format and serializing/deserialzing this XML data creates a lot of transient objects resulting in a spike in your memory usage. When you have enough data in the DataSet, like synchronizing large number of database rows, your app has the potential to go out of memory deserializing it. Infact users of SyncServices for ADO.NET V2 are quite aware of the OutOfMemoryException when they are synchronizing large number of records in the 2-tier model (PS: Solving this is the highest priority for us in the next release). Using the WCF solution increases the likelihood of this error happening even for mid sized data that fits fine in memory.
There is an easy way to optimize this problem and obliviously it requires users to move away from the XML based default DataSet serialization. Since DataSet doesn’t support any other format, the only way is to use a Surrogate to serialize it. Microsoft Knowledge base has a wonderful article detailing this Surrogate and it can be downloaded from http://support.microsoft.com/kb/829740.
Download it and use it in your WCF based Synchronization apps and you should see vast improvement in your memory usage and performance.
I wrote a quick sample checking the process peak memory usage during serialization/deserialization of a DataSet. Here are the comparison numbers between default and surrogate serialization.
The DataSet contains one DataTable which contains one string column. Each column is a 5Mb string. Size is Peak working set.
Point proven :). As you can see for each 5Mb of data added in memory the default serialization takes anywhere from 6-8 times more memory. For 50 rows (approx 250Mb data size in memory) the serialization step itself fails with OOM.