Initialization of Merge Replication subscriber fails with primary key violation

Troy Moen
Microsoft SQL Server Escalation Services

We encountered an unusual timing problem while pushing a Snapshot down to a merge subscriber.  We don’t expect many to encounter this problem, after all we’ve only seen it once in 15 years.  We wanted to share this as a learning opportunity as it illustrates a classic application\thread synchronization problem.

Scenario

Customer has Merge Replication running and need to reinitialize 1 subscriber.  Updates occur on both the Publisher and Subscriber.  At the time, we believed all subscriber updates were being redirected to the publisher, but as you’ll see, we were wrong.

During the re-initialization the Merge Agent failed with the following message.

Error messages:

· The merge process was unable to deliver the snapshot to the Subscriber. If using Web synchronization, the merge process may have been unable to create or write to the message file. When troubleshooting, restart the synchronization with verbose history logging and specify an output file to which to write. (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147201001)
Get help: http://help/MSSQL_REPL-2147201001

· The process could not bulk copy into table '"dbo"."Table_1"'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20037)
Get help: http://help/MSSQL_REPL20037

· Batch send failed (Source: MSSQLServer, Error number: 0)
Get help: http://help/0

Violation of PRIMARY KEY constraint 'PK_Table_1'. Cannot insert duplicate key in object 'dbo.Table_1'. (Source: MSSQLServer, Error number: 2627)
Get help: http://help/2627

If we started with an empty table and loaded rows from the Publisher how could we possibly get a “duplicate key”?  Did the duplicate data already exists in the saved BCP data from the Publisher?  We checked and it did NOT!!

We looked at the subscriber table and notice it contained 1 rows with Primary Key value 1, but this was a different row then 1 from the BCP data we’re trying to load.  From where did this data originate?  Looking closer at the data we determined it was inserted only moments ago directly on the Subscriber from Subscriber’s application, which yes, we thought was stopped, but was still trying to insert data.  The application log shows it failing with “table not found” right up to the point where the table got created by the Merge Agent, then pop, it gets inserted.

Consider the following:

  1. A publication exists with one table “Table_1” containing 100 rows.
  2. A user marks the subscription for re-initialization.
  3. A snapshot is generated of table, which contains max primary key value of 100.
  4. The Merge Agent begins application of the snapshot.
    1. Table is dropped from Subscriber
    2. New empty table structure, with default starting identity value 1 is created
  5. Split millisecond before Merge Agent can execute Bulk Insert from the Publisher rows, the subscriber client inserts row into the subscriber.  This is assigned identity value 1.
  6. Merge Agent start bulk inserts from previously saved Publisher data containing identity values 1 – 100, it fails on the 1st row as identity 1 already exists.

Under the right conditions, the above example can result in a primary key violation.

Solution:

Disable clients from entering data into the subscriber database during that time. In most cases the “window” in which duplicate data could be inserted is very small, which varies depending on subscriber performance at the time. In my testing, only a few seconds maximum had passed before the trigger was created, and thus stopped any chance of duplicate data being added. 

Another creative solution was to create a TRIGGER as part of the table creation to abort any user activity until after the Merge Agent bulk loads existing data, then remove the trigger to if you will, open the spigot and allow data changes to flow.  While creative, it takes a while to get all the detail in place for this to happen.  The first option is much easier.  Let’s take a look at this creative solution from a high level. The Merge Agent will follow these steps when applying the snapshot after a subscription has been marked for re-initialization:

1. Drop the replicated table on the subscriber, provided “Action if name is in use” is set to “Drop existing” in the article properties.

clip_image002

2. Create the replicated table on the subscriber.

3. Apply a trigger disabling DML operations on the table in question:

create trigger [dbo].[MSmerge_disabledml_9F571C3241264DA6972D846C0A738A71] on [dbo].[Table_1] for update, insert, delete

not for replication

as

set nocount on

if @@trancount > 0 rollback tran

raiserror (20092, 16, -1, '[dbo].[Table_1]')

return

GO

4. BCP the snapshot data into the table.

5. Remove the disable DML trigger.