I recently came across an interesting situation in merge replication and thought I should write about it:
Description: > There is a high volume replication setup with thousands of subscriber and a very volatile(lot of transactions per second) publisher database> Started observing huge blocking when multiple merge agents starts synchronizing> Even if merge agents are stopped, the time taken for a single merge agent is huge> There are millions of rows in merge metadata tables (msmerge_genhistory, msmerge_contents and msmerge_tombstone) and they are not getting cleaned up> Running update stats every hour does not help
Cause: > The huge backlog in the merge metadata tables was causing the merge queries to take long times. These long times were resulting in blocking when multiple merge agents were sync'ing. > This is because the msmerge_genhistory metadata table is referred everywhere during merge sync. Any query that joins this table (and there are a lot of them) had to get blocked since some other queries had taken locks on the same tables rows. > The root cause of this problem was the millions of rows those metadata tables were not just getting cleaned up.
What I found out: > Once I found that the root cause of the issue is the backlog in the metadata tables, ran sp_mergemetadataretentioncleanup manually to force a cleanup and it returned immediately with 0 rows cleaned. > Ran it repeatedly and that did not help. > So, its obvious that the merge agents (Note: merge agents run retention based metadata cleanup as the first step during regular sync) were also not cleaning anything from those tables. > Further looking into the code of this proc (by running sp_helptext sp_mergemetadataretentioncleanup) found that it checks if any other agent is running sp_mergemetadataretentioncleanup (of it this SP is being manually on the server). If yes, it just skips execusting any code and returns.
<snip>-- if somebody else is already cleaning up in this database, we simply return set @applockname= 'MS_sp_mergemetadataretentioncleanup' + convert(nvarchar(11), db_id()) exec @retcode= sp_getapplock @Resource= @applockname, @LockMode= 'Exclusive', @LockOwner= 'Session', @LockTimeout= 0, @DbPrincipal = @DbPrincipal if @@error <> 0 or @retcode < 0 return (0)</snip>
> From DBCC OPENTRAN, found that there was one merge agent which was running sp_mergemetadataretentioncleanup and taking long time but the user might have hit cancel and it was hung while rolling back. > This was causing other executions of sp_mergemetadataretentioncleanup to not work and also causing the metadata to keep getting stale causing further performance (and blocking issues) > Killed that spid, and stopped all the merge syncs and ran sp_mergemetadataretentioncleanup manually. > This took long time as it always cleans the rows from metadata tables in a batch of 5000 rows. Also it checks expired subscriptions (using sp_MSmark_expired_subscriptions) and this can take some time when you have thousands of subscribers. > The complete execution took long time but it cleaned millions of rows from metadata tables. There were only a few hundred rows left in the metadata tables after this. > After this, the merge queries ran instantly and there was a almost no blocking.
If you come across a situation where you find sp_mergemetadataretentioncleanup is not cleaning up rows from metadata tables, it will be worth to: stop all merge agents, confirm that no spids that might be executing any merge metadata commands are lying around.After this, run sp_mergemetadataretentioncleanup manually and wait till it finishes completely. Once it finishes completely, check the table count. After this, start the merge agents again.