Merge Replication Capacity Planning

Chris Skorlinski
Microsoft SQL Server Escalation Services

I was recently asked for recommendations for Merge Replication capacity planning project.  While I spend most of my time tuning and troubleshooting Replication environments, I decided to post some thoughts on about stress testing and capacity planning.  I hope this posting helps as you are designing your Merge Replication environment. 

Background Reading

Merge Replication Performance Tuning and Optimization (bit out dated, but still a good read)

Diagnosing and Troubleshooting Slow Partitioned Merge Processes (better article)

Merge Replication Performance Improvements in SQL Server 2005

Merge Replication Performance and Scalability Cheat Sheet  (great list of best practices)

Collection of Merge Replication Best Practices (my posting)

Okay, now we got some of the background reading out of the way.  Let me say I’ve see a wide range of Merge topologies.  Everything from 15,000 users load balanced across multiple servers to 1 subscriber as a disaster recovery site.  I seen simple designs with little or no filtering supporting 1000 of users to very complex filtering/join design barely able to hand 10 subscribers.  Thankfully Merge Replication allows you to scale out through alternate Publishers (sync partners) as business needs grow. While this doesn’t correct overly complex designs, it does provide relief.

Merge Replication fits a very specific need, that is updates being performance across multiple SQL Servers then “blended” together into one comprehensive view of the data.  Key point is Merge Replication uses queries to pull the changed data and update the SQL servers.  If these queries are overly complex, or moving millions of rows of data, performance will suffer.  Merge can only work as fast as SQL can query the changed data.  The complexity comes when the static or dynamic filtering logic is so complex Merge is having to analyze 5-level deep parent-child relationships just to determine who gets what data. This is why “filtering” is mentioned in very “best practices” posting.

So, great, how does this help me with capacity planning?  Keep filter logic as simple as possible to reduce logic overhead.  Then develop a test strategy based on realistic data volumes and simulate simultaneous users synchronizing their changes.  If your Publisher can’t keep up, look to limit the number of concurrent users.  (see below).  This may cause some users to wait for a brief time, but you’ll get a realistic synchronization times.

Merge Processes (
Limit concurrent processes

Determines whether to limit the number of Merge Agents that can run at the same time. This is typically used if a publication has a large number of push subscriptions that might be synchronizing simultaneously.

Maximum concurrent processes

The maximum number of Merge Agents that can run at the same time (requires Limit concurrent processes). If the number of agents synchronizing exceeds the maximum, agents are put in a queue until the number drops below the maximum.

You can also query the Merge Replication history tracking tables in the Distribution database to get a quantitative measurement of how long each synchronization took to complete and how much data was moved (Ins/Upd/Del) for each published tables.  Again, push the simulation until you see a bottleneck.  My last customer has 900 users synchronizing against 3 publishers moving about 5000 data changes per user synchronizing twice a day.  The multiple publishers provided both load balancing and redundancy should one publisher be offline for maintenance.

Tracking Merge Agent Performance using Distribution database system tables

Can you get 300 users synchronizing 5000 data changes every hour on 1 publisher?  Sorry, if you read this far expecting some magic formula to plug in # users, # tables, # updates, # CPUs, amount of RAM, etc, because I don’t have any. So much depends on your specific environment such as CPU, Disk, Memory, Network (all the classic bottlenecks), in addition to filter logic and data volumes.  You’ll only know your particular environment works by developing a test plan then push your Replication topology until bottleneck occurs.