SQL Server Merge Replication Best Practices

“REPL TALK”

Chris Skorlinski
Microsoft SQL Server Escalation Services

I’m posting a collection of SQL Server 2008 Merge Replication Best Practices.  The material below includes text from SQL Books Online, other white papers, KB articles, web postings, and little from experience I gained working with SQL Replication for almost 15 years.  The “best practices” below are in no particular order.  My objective was to just get these all up on one posting. 

Other postings to follow will include:

  • Merge – Performance Tuning
  • Merge – Filters
  • Merge – Conflict Evaluation and Resolution
  • Merge – Troubleshooting Non-Convergence
 
Key Concepts for Merge Replication

Merge is design with a central Publisher containing a master copy of the data and remote Subscribers with a copy of the Publisher data.  The Subscribers can have a complete copy of a subset of the Publisher data.

SQL Server Merge.exe will “blend” data between 2 different SQL Servers.  This is called “synchronizing”.  When Synchronization is complete and all changes have been upload to central Publisher and download to the Subscriber the servers are considered “in synch”. 

Not all subscribers need to get the same data.  Filters are used to determine which subscriber gets which data.  For example, a route driver with a Pocket PC only needs to see stop on their route and not stops for all routes.

SUBSET filters are queries defined with SELECT WHERE clause used to limit rows for a subscriber such as SELECT * from Customers where SalesPerson = Suser_Sname().

JOIN filters are queries defined using JOIN between two or more tables such as SELECT * FROM OrderHeader JOIN OrderDetails.

By default Merge will use these filters when data is INS/UPD/DEL to predetermine which subscriber will get the data.  This is called “precomputed partitions”.  This offloads the Filter logic to the updates on Publisher. All the subscriber needs to do is query for its already preallocated data. The SQL 2000 method instead forces each subscriber to execute these same Filters against ALL the Publisher data in order to find that subscriber rows.  This SQL 2000 option is still supported for backward compatibility.  I think “backward” being the key word.

Optimizing Parameterized Filter Performance with Precomputed Partitions
http://msdn.microsoft.com/en-us/library/ms151737.aspx

Merge uses Replication created triggers on user table to flag what data needs to be moved to which subscribers.  This information is stored in systems tables. Most important is msmerge_contents which contains one row for each modified record.

Merge processes includes logic to determine which SQL “win” in the event of a conflict.  A conflict is caused when same row/column is changed on 2 different SQL Servers.

Distribution Database is only used to track Merge Replication historical data.  Unlike Transactional Replication it does not contain data from user tables.  This allows the database to be restored with little or no loss of user data.

Subscribers can synchronize over direct LAN/VPN SQL <-> SQL or over WAN network or SQL <-> IIS <-> SQL (called Web Sync).

Subscribers can be full SQL Server or SQL Compact Edition subscribers.

Tools to Monitor Performance

When analyzing Merge Performance use standard DBA tools to analyze key Merge query performance.  If you can improve the Merge queries you’ll improve overall system performance for synchronizing data. Identify high IO or high CPU SPs then explore how to tune.

Use Replication Monitor to get overall view of Replication performance.

Add –OUTPUT to Merge Agent to identify which step (stored procedure) in the Merge processes is failing or taking longest time.

http://support.microsoft.com/kb/312292
--or--
http://blogs.msdn.com/chrissk/archive/2010/01/05/walk-through-of-kb-article-312292-how-to-enable-replication-agents-for-logging-to-output-files-in-sql-server.aspx

ReplMerge.log

Merge Replication automatically generates ReplMerge.log file which is ON by default. The minimal default logging may provide enough detail to resolve issue without needing to collect verbose log or Profiler trace.

C:\Program Files\Microsoft SQL Server\90\COM\replmerge.log

sys.dm_exec_query_stats

This DMV can be used to examine query plans for the Merge Replication created SPs used to evaluate Publisher updates and to move data throughout the Merge topology.  Looks for high CPU and high IO queries.  Try updating STATS or REINDEX on key Merge system tables then see if a better query plan (lower IO and CPU) is generated by Query Optimzer

--Find TOP Top LogicalReaders for Merge Replication

--Examine Query Plans for SCANS

--Update STATS and/or REINDEX to see if better query plan is generated

SELECT TOP 25

      st.text, qp.query_plan,

    (qs.total_logical_reads/qs.execution_count) as avg_logical_reads,

    (qs.total_logical_writes/qs.execution_count) as avg_logical_writes,

    (qs.total_physical_reads/qs.execution_count) as avg_phys_reads,

      qs.*

      FROM sys.dm_exec_query_stats as qs

            CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st

            CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp

      WHERE st.text like '%MSmerge%'

--  WHERE st.text like '%sp_mergemetadataretentioncleanup%'

--  WHERE st.text like '%sp_MSenumgenerations90%'    

--  WHERE st.text like '%sp_MSmakegeneration%' 

      ORDER BY qs.total_logical_reads DESC

 

If you really want to use SQLDIAG or SQL Profile, capture RPC:COMPLETED events.  These are the top-level Replication SP calls.  The Completed event will have final IO and CPU data.  Search and tune highest CPU/IO SPs.

 

Replication System Tables

Often unexpected Synchronization performance can be contributed to unexpected high data volume.  After all Merge is just queries against SQL table in order to determine which data to upload/download.  Examine Merge system tables for counts 10 million (for example).

SELECT COUNT(*) FROM MSmerge_contents (nolock)

SELECT COUNT(*) FROM MSmerge_tombstone (nolock)

SELECT COUNT(*) FROM MSmerge_genhistory (nolock)

SELECT COUNT(*) FROM MSmerge_partition_groups (nolock)

SELECT COUNT(*) FROM MSmerge_current_partition_mappings (nolock)

SELECT COUNT(*) FROM MSmerge_past_partition_mappings (nolock)

Replication History Tables

High sync duration could be result of unexpected high data volume.  Use query below to see which date had mass amount of changes.

SELECT COUNT(*) 'generation count', sum(changecount) 'change count',

datepart(dd,coldate) 'day of month', datepart(hh,coldate) 'hour of day'

FROM MSmerge_genhistory (nolock)

group by datepart(dd,coldate), datepart(hh,coldate)

order by 3,4

 

 

--Distribution database tables of interest

select * from distribution.dbo.MSmerge_agents

select * from distribution.dbo.MSmerge_history

select * from distribution.dbo.MSmerge_sessions

select * from distribution.dbo.MSmerge_articlehistory

 

--Replication errors table

select * from distribution.dbo.MSrepl_errors

Setting which affect Merge Performance

Each of these Merge Replication options will impact overall Merge Replication performance.  Search SQL BOL and MSDN Forums for detailed explanation of each feature.  Every user’s environment is different.  Make small changes in a test environment to learn impact of each change.  I’ll write more information in a separate “Merge Performance Tuning” post.

  • -DownloadGenerationsPerBatch
  • –ParallelUploadDownload
  • -SrcThreads
  • -DestThreads
  • -MakeGenerationInterval
  • -DownloadGenerationsPerBatch
  • -UploadGenerationsPerBatch
Non-Convergence.

When a synchronization fail to Merge data you may notice rows being removed form the Subscriber or rows showing up on a different subscriber.  The subscriber is considered “non-converged”.  Most common cause is overly complex Subset/Join filter logic resulting in unexpected data movement.  I’ll write another post on “Merge Non-Convergence”.  Some tools to help with non-convergence include:

  • sp_mergedummyupdate  (really cool way to trigger a Merge on an individual row,  see SQL BOL)
  • tablediff.exe (really cool way to get all TSQL needed to manually bring Pub/Sub in sync)
General Merge Replication Best Practices

Plan for when it break.  Try to experiment with different failure to learn how in your environment steps you’ll need to get it back up and running.  Nothing quite like having to fix it before its really broken.

Keep Article filter logic as simple as possible.  Keep Article filter logic as simple as possible.  Yes important enough to say it twice.  I’ve seem some very complex logic to try and determine what data goes to which subscriber.  De-normalize the data if needed.  This is important enough that I’ll post a separate topic just on “Merge Filters”.

Look at Publication SUBSET and JOIN filters and verify all columns referenced have indexes defined.  I spent many hours trying to “tune” Merge Replication and track down the problem to a columns used in row filter or join filter that was missing an index forcing the Query Optimizer to Table Scan.

While on subject of Merge Filters, verify you’re using “precomputed partitions”.  To confirm, script out the publication and check for use_partition_groups = ‘true’.  When ‘false’ usually indicates the Filter logic was overly complex and Merge was unable to determine how to preallocated data into buckets for each subscriber.  Examine the Filter logic and look for ways to simplify via denormalization of the tables if needed.

sp_addmergepublication . . . @use_partition_groups = ‘true’

http://msdn.microsoft.com/en-us/library/ms151737.aspx

Minimize conflicts through publication design and application behavior. Check the _CONFLICT tables and analyze why the conflicts are occurring.  If you can eliminate root cause for these conflicts you’ll greatly improve the synchronization performance.  I’ll post a separate topic on “Merge Conflicts”.

Limit the use of Large Object (LOB) data types (varchar(max), nvarchar(max), varbinary(max)).  Yes, you can replicate these, but see if you can find another way to move the LOB between servers use Merge to move data in other columns.  For example, split the LOB data into its own table with a 1-1 relationship to main data table.  This way if the non LOB data changes, like “date last modified” you’re not re-replicating the LOB data.  You can even use SSIS to keep servers updated.

While on the subject of data types, avoid IDENTITIES.  Yes, I know, they are supported in the product, but if look at all the logic to ensure each user has a subset of personal Identity values and what is needed “refresh” when subscriber runs out of ranges you’ll wonder how it ever worked.  Most developers/DBAs are instead using GUIDs.  If you must have Identity values, use BIGINT with “manual range management” then RESEED each subscriber to a private range.  We had one customer calculate BIGINT with high SEED value would allow subscribers 30 years of new data.  That takes them all the ways to SQL 2040 without having to RESEED.

Publish only the data required.  Because replication is easy to set up, there is a tendency to publish more data than is actually required. This can consume additional resources within the distribution databases and snapshot files, and can lower the throughput for required data. Avoid publishing unnecessary tables and consider updating publications less frequently.

Set user-defined triggers and check constraints to “not for replication”.  These don’t need to fire when Replication is moving data as the data should have already been checked and verified using the initial application UPD/INS/DEL.

If large number of Merge subscribers, use PULL and stagger the synchronization schedule.  You can even limit the number of simultaneous Merge Agents to reduce system overhead.

Watch that RETENTION PERIOD.  Nice to keep 2 weeks of Merge Tracking data just in case a subscriber has been offline for 2 weeks, but at what cost in terms of Replication overhead.  Keep RP to minimum without getting you into trouble.

Perform regular index maintenance, update stats, reindex, on Replication system tables just like you do for user tables.

  • MSmerge_contents
  • MSmerge_genhistory
  • MSmerge_tombstone
  • MSmerge_current_partition_mappings
  • MSmerge_past_partition_mappings

A lot of DELETEs or “out of partition changes” can slow down performance.  Track how often these occur and reason why.  For example, is application doing a DELETE followed by INSERT instead of an UPDATE?  This will force Merge to track both the DELETE as well as the INSERT instead of just one UPDATE.  Deletes are tracked in these Merge system tables.

  • MSmerge_tombstone
  • MSmerge_past_partition_mappings

Great thing about Merge Replication is that as you add more subscribers you can always "scale out" by adding another "re-publisher".

Merge may not be best solution.  Consider a mix of technologies and topologies to accomplish the business goal such as SSIS, Log Shipping, Database Mirroring, Peer-2-Peer, Change Tracking, Snapshot, Merge, and Transactional Replication.

Backup Restore Best Practices

Plan for when it break.  Try to experiment with different failure to learn how in your environment steps you’ll need to get it back up and running.  Nothing quite like having to fix it before its really broken.

http://msdn.microsoft.com/en-us/library/ms152497.aspx

  • What happens if I restore a Publisher to earlier time?
  • What if I restore Distributor?
  • What if I restore Subscriber?
  • What if the last backup I have is older then the Publication Retention Period?
Knowledge Base Articles and Web Links
  • KB 310308 INF: Understanding Merge Replication Metadata
  • KB 307356 INF: Understanding Merge Replication Article Processing Order