Merge replication and complex filters
I'm currently doing a design review for a SQL CE + SQL 2000 mobile system developed by Thomas Granath and Anders Lundstedt that uses merge replication as the basis for sending data to the device. The exercise was prompted by slow perf seen during replication. It’s a work order processing solution for field engineers dealing with routine maintenance and adhoc breakdown of factory equipment across several European countries. I wanted to share this because I learned a lot of very useful information about the workings of merge replication and specific mobile considerations.
Just to give you some idea of the size: The main DB is around ~20GB and has ~30 tables; the biggest is around 400k rows. The structure is not particularly complex with maybe 5 main fact tables including things like Technician, Assignment, Customer. The publication includes most tables and is horizontally and vertically partitioned to reduce the data set for each device. The filtering is very interesting: it uses a dynamic filter on the technician table from HOST_NAME() and uses join filters to pretty much every other table. Using these dynamic filters provides a very powerful way of restricting data size synchronized to a subscriber to exactly what is needed. It results in a typical DB size of around 10MB on the device. When using a dynamic filter there is no initial snapshot used to sync with the device as each user receives different data, so each device sync must pull the initial and subsequent merge data through a set of views that are created by the publication wizard. Further views are created for each table join used in the publication – more later.
The customer is seeing the server becoming CPU bound with only a few active replications taking place - the server is a quad proc, 4 gb proliant with next to nothing else running. Average replication time is 4 minutes with around 200 – 400 rows updates being sent to the subscriber, and 10 or so updates inbound from the subscriber. The load appears to be independent of db size (mostly), but is exponentially (its maybe more of a non linear growth rather than true exponential) affected by the number of active subscriptions, which is == to the number of technicians registered. Even when no data is being returned to the client, the merge takes the same length of time. An average technician connects maybe 3 or 4 times a day.
The tables are using GUID’s (UniqueIdentifiers columns) rather than identity columns. This does have an impact on perf as a guid requires 16 bytes and a bit more processing compared to a 4 byte integer, but benefits from significantly simpler setup for data inserts – identity column replication requires partitioning of the int range to ensure no clash, but guids are unique for free. Additionally I could see one or two badly placed clustered indexes but nothing to fully account for the drop in perf per technician.
I initially suspected the culprit for bad perf to be the complex filters used on the publication, cascading from the technician table and including quite a few joins. With a joined filter the generated view uses a FROM <previousview> clause to join the two views. The replication views that are generated end up looking pretty horrendous. In this case the worst join ended up being 4 views deep.
Looking a little deeper at the operation of the join filters in the merge these proved not to be the source of perf issues. We could deduce this because when an initial replication takes place its only using 5 to 10 seconds to send around 40k rows of data! Compare this with a merge replication taking 4 to 6 minutes and updates maybe 300 rows and its quite apparent the views in isolation cant be causing all that delay. What caused some confusion is these long merges only occur as the number of subscriptions rises to around 200 users – still with only ~4 active sync’s taking place at a time. With fewer users the individual merge times drop dramatically.
There is a ‘magic’ option in SQL 2000 called Optimize Subscription designed to minimize data transferred to a client - ideal when the client is a Windows Mobile device using a 9.6k link. This option creates 'before image' tables to assist tracking data moving between partitions (e.g. updating a row causes that data to become visible / removed for different subscribers - for example if a customer moved house and data was partitioned on geography). When an article has a change that has the potential to cause a partition change, the merge agent must send a delete row request to every relevant subscriber in order to ensure the rows are correctly removed. Before image tables allow the merge agent to track where data was prior to a partition change and so filter the deletes and remove much of the redundancy prior to sending them to a client. Before image tables are organized by ‘generation’. A generation is a sort of checkpoint in the data that is used to identify where the merge agent needs to start looking for changes for each of the synchronizations. They work like this:
- lets start with generation 10. OLTP changes are added to the before image tables with that generation
- when a subscriber replicates data it sends its generation to the merge agent that combines all changes for each subsequent generation since then.
- When the replication is complete, a new generation is used. Any updates from the merge subscriber are added with the new generation.
This is really the only feature that can be affected directly by the addition of new subscriptions and so has to be the cause of merge slow down – books online has a specific warning about using Optimized Subscription setting for fast changing data for this very reason.
When a SQL CE device replicates it makes a single call the SQL CE ISAPI extension running under IIS. That code opens a connection with the database and eventually calls a procedure sp_MSsetupBelongs with a list of generation numbers to process. This is repeated until all generations have been processed. The sp_MSsetupBelongs procedure is moving over the before image tables generating a change table (temporary) that eventually contains all the data to be sent to the client. Obviously the performance of this iterative process is directly affected by the number of generations. It’s also effected by the number of changes per generation, but less so. The change table is updates repeatedly for each generation since the last merge, with rows being added and removed as appropriate.
Each new subscriber to the system will cause 4 or 5 merge processes per day. Each merge causes a new generation to be created and increases the load on every subscriber that subsequently merges thus increasing the delay in a non linear fashion.
So the obvious option is to turn off the Optimized Subscription option and stop all this before image table iteration and improve performance. But that takes us back to the original problem that the replication agent cant efficiently identify partition changes and so has to send lots of deletes to every client! The delete command is only 16 byte GUID key (+padding) but when there are ~4000 deletes being generated that’s ~64k per merge, over a 9.6k link gives about a minute just to send the delete commands to the client. The customer is going to investigate if saving in merge time makes this option viable in the short term.
The challenge here is they are too slow during merge if Optimization is on and too slow in data transfer if it’s off!
So what’s the answer?
We talked around several data relation changes to their DB and application to simplify the merge process, specifically looking at reducing the number of partition changes caused by normal processing. For example when a job is complete, don’t physically delete it from the device but instead mark it deleted and leave it on the client. Although the number of partition changes could be significantly improved there are two effects: increased database size and the need for some sort of batch update process.
The good news is that SQL 2005 and SQL Mobile bring a few improvements specifically for these problems:
- Much improved filter performance on SQL 2005 – I haven’t found exact details yet. I will post my finding when I get some
- Data compression from the IIS agent and the device, so the 64k of delete commands should be significantly improved.
- Use ‘download only’ tables. Using download only will give them the option of removing several fast changing tables from the merge process and in combination with a delete flag and web service call has the potential to massively reduce partition changes and therefore delete request.
- SQL Mobile supports multiple db connections giving the option to remove the sync process from user interaction and run it automatically in the background – this will remove (minimize) the user perception of delay which is nearly as good as making it really go fasterJ
It’s worth noting that the system design they currently have to solve their specific business problem is about right with maybe one or two minor modifications I could suggest. Using merge replication is an extremely powerful solution and radically simplifies the amount of custom code needed to build a solution of this complexity. It might be possible to achieve better merge perf using a different approach to the database schema by reducing partition changes but that would come at a significant dev cost for changing back office systems and the device application, and in the end can only become a compromise.
The one factor that stands out from the overall design is just how visible the synchronization process is to the user. The current system requires the user to manually initiate and wait for sync to complete – during the sync process the whole application is effectively locked from use. The outcome is that delays in sync due to network, device or server performance are fully visible to the user and form a big part of negative feedback. Ideally a mobile application should ‘just work’, hiding as far as possible the internal workings of network connection, data transmission and retrieval. The current design used in this system is partly historic but also SQL CE 2.0 limitation can impede the developers desire to hide sync operations from the user because of the single db connection limitation. Its still possible to protect the user from much of the network plumbing but requires careful thought and potential design alterations to fully succeed. SQL Mobile brings the welcome improvement of multi connection’s and associated data locking strategies.
So all that left for me to do is produce the design report doc for the customer… that means the real fun is over, I hate writing formal docs L
Marcus