An overview of the inner working of a replication monitor

 

We might see large number of Pending transactions on one of the Publisher in P2P replication. But when we check the Replication Monitor for second server (Subscriber) it might show a small value (which would be the actual value). You might see that the latency on both sides are which is very less.

 

All of these point that there might a wrong notification from the Replication monitor.

 

First we can try the manual way to see how many transactions are actually undistributed.

 

select * from msreplication_subscription (on the subscriber database)

 

·         We would get the value under transaction_timestamp column.

 

·         Remove the trailing 0's and try the following

 

select count(*) from msrepl_transactions where xact_seqno > <seq_no> (on distribution database)

 

·         It would return the number of rows which we see in Replication monitor on Subscriber

 

 

Then we can try to use TracerToken to check the latency

 

STEPS

====

sp_posttracertoken @publication='<Publication Name>'

sp_helptracertokens @publication='<Publication Name>' ---> This would give the Tracer_ID

 

sp_helptracertokenhistory @publication='<Publication Name>',@tracer_id='<number>'

 

sp_deletetracertokenhistory @publication='<Publication Name>',@tracer_id='<number>' --> This is to clear the traced token

 

=> We can try this few times and confirm a low latency. A low latency would mean that replication is working fine.

 

 

Above steps would help identify that the replication itself is working fine but there is some problem in replication monitor when we see values

 

-----------------

 

The Replication Monitor runs the following stored procedure to show number of undistributed transactions.

 

sys.sp_replmonitorsubscriptionpendingcmds

 

If you look at the text of the stored procedure. We would see that it is basically only looking at the database MSDistribution_history

 

Here I have prepared a reduced form of this Stored procedure which would do the same thing that the replication monitor does (in terms of showing Undistributed transaction)

 

<Reduced Form of sys.sp_replmonitorsubscriptionpendingcmds>

declare @avg_rate int

declare @retcode int

DECLARE @mstestcountab TABLE ( pendingcmdcount int )

select @avg_rate = isnull(avg(<delivery_rate>),0.0)

        from dbo.MSdistribution_history

        where agent_id = 3

 

    insert into @mstestcountab (pendingcmdcount)

        exec @retcode = sys.sp_MSget_repl_commands

                                    @agent_id = 3

                                    ,@last_xact_seqno = <xact_seqno>

                                    ,@get_count = 2

                                    ,@compatibility_level = 9000000

 

select pendingcmdcount from @mstestcountab

 

    select

        pendingcmdcount 

        ,N'estimatedprocesstime' = case when (@avg_rate != 0.0)

                                then cast((cast(pendingcmdcount as float) / @avg_rate) as int)

                                else pendingcmdcount end

    from @mstestcountab

 

<Reduced Form of sys.sp_replmonitorsubscriptionpendingcmds>

 

=> The Delivery Rate and the XACT_Seqno in the above script is got from the MSDistribution_History table.

 

=> You can used the following to get that information

 

select xact_seqno, delivery_rate from dbo.MSdistribution_history

where agent_id = <give the agent ID number) and runstatus in (2,3,4)

 

=> When you feed that result to the script, you would see that the number returned is same that of the Replication Monitor.

=> So now we would know that the MSdistribution_history is not updated properly and the Xact_Seqno is a very old transaction and it should be updated with the latest replicated transaction.

 

=> The fact is that we have something called  sp_MSadd_distribution_history which is also executed by sp_MSagent_stethoscope which is in turn executed by sp_replication_agent_checkup.

=> So we should have  a job called Replication Agents Checkup.' which runs every 10 secs and this should be doing the job of updating the proper xact_seqno

 

=> In most cases where we see this issue, that Job is not running or is running but with the following message.

 

<Message>

The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active. [SQLSTATE 01000] (Message 20554).  The step succeeded.

<\Message>

 

Steps we could take from here to get correct values into the MSDistribution_History

 

a) Restart the Agents to see if that helps

b) Collect a detailed profiler trace when the "Replication Agents Checkup" job is running and we have to see if the SP_MSAdd_Distribution_history is run or not.

c) We can try and delete the rows manually. Then the appropriate job should insert  and then we see if its updated subsequently.

                <STEPS>

                1) Take a backup of the Distribution Database.

                2) Create a test table (temp_MSDistribution_history) with same schema as the original 

                    MSDistribution_history table.

                3) Select all rows from the MSDistribution_History table and insert into the newly created 

                    table.

                4) Manually delete the rows in the Distribution history for the specific agent ID that we figured 

                    out the last time.

                5) Run the Replication Agents Checkup and see if a new row for that agent ID is inserted

                    with the upto date  xact_seqno.

 

                Deleting rows from the MSDistribution history would not cause any problem on the replication.

                But to ensure proper safety, I suggest to backup the Distribution and if deleting the rows 

                does not help, then we can always re-insert the deleted rows from the 

                temp_MSDistribution_history to MSDistribution_history.

 

               <\STEPS>

 

Note: In most cases, a restart of the agents resolves the issue.

Regards,

Gourav Das

Support Engineer, Microsoft SQL Server.

 

Reviewed by,

Ouseph Devis T & Akbar Farishta

TL, Microsoft SQL Server support

Gaurav Mathur

SEE, Microsoft SQL Server support