Transactional Replication Conversations

When troubleshooting Transactional Replication performance break down the flow of data into 4 simultaneous conversation.  Examine performance of each conversation.  This will help verify where the bottleneck investigation should begin.

 

 

 

1)      LogReader Reader thread is reading the Transaction Log via stored procedure sp_replcmds, a wrapper for xp_replcmds.  It scans the transaction log for transactions marked for replication, skipping not replicated transactions.

 

 

2)      LogReader Writer thread writes queued transaction from the Reader thread into the Distribution database using sp_MSadd_replcmds.

 

3)      Distribution Reader thread executes sp_MSget_repl_commands query to retrieve pending commands from the Distribution database and storing in an internal queue.

 

4)      Distribution Writer thread writing queue commands to the Subscriber via parameterized stored procedures prefixed with sp_MSupd..., sp_MSins..., sp_MSdel... to apply individual row changes to each article at the subscriber.

 

5)      Log Reader and Distributor are also executing a “history” thread writing summary data to MSlogreader_history and MSdistribution_history system tables of the distribution database.

image

 

Tools to assist with troubleshooting performance

 

SQL Profiler: There are multiple methods to examine in detail the Transaction Replication conversations.   As with all application connecting to SQL Server, the individual commands being executed by Replication components can be captured using SQL Profiler tool.  The Profile trace can be sorted or queried to find long running statements. 

 

Examples of which ApplicationNames you will find:

·         REPLICATION MONITOR -- GUI from Management Studio

·         REPL-LOGREADER-#-DatabaseName-# -- Writer thread from Logreader Agent to log changes into distribution tables.

·         ServerName\Instance-DatabaseName-PubName-Subscriber -- Reader thread from Distribution Agent to find out which commands/transaction to replication to the Subscriber

·         REPLICATION LOGREAD HISTORY -- writer thread from Log Reader agent to log history

·         REPLICATION DISTRIBUTION HISTORY -- writer thread from Distribution agent to log history

 

Example of Replication Stored Procedure calls:

  • Distribution Activity:
    • SP_MSGET_REPL_COMMANDS
      SP_MSHELP_DISTRIBUTION_AGENTID
      SP_MSGET_SUBSCRIPTION_GUID
  • LogReader Activity:
    • SP_MSADD_REPLCMDS
      SP_MSADD_LOGREADER_HISTORY
      SP_REPLCMDS
      SP_MSPUB_ADJUST_IDENTITY
      EXEC SP_MSGET_LAST_TRANSACTION @PUBLISHER_ID = {##}, @PUBLISHER_DB = {STRN}, @FOR_TRUNCATE = {BS}
      SP_REPLDONE
      EXEC SP_REPLCOUNTERS {STRN}
  • SubScriber Activity:
    • SP_MSins%
    • SP_MSupd%
    • SP_MSdel%

 

Activity Monitor: SQL Management Studio Activity Monitor, Performance Dashboard for SQL 2005, Performance Data Warehouse for SQL 2008, or system stored procedures help find blocking, disk bottlenecks, and statements causing high IO and high CPU.  To help identity Replication components look for the LogReader or Distribution Agent Job Names as the Program_Name connecting to SQL Server.  To see a list of the Replication Jobs, run the query below

 

SELECT name, description, enabled from MSDB..sysjobs

WHERE category_id>10 and category_id<20

 

Tracer Tokens: To gain a quick high-level overview of Replication performance use the Tracer Token feature in Replication Monitor. The features provides “Publisher to Distributor” and “Distributor to Subscriber” latency values.  To use this feature, launch Replication Monitor. In the My Publishers list, drill in and click on the publication you wish to test. In the right pane, choose the Tracer Token tab and click the “Insert Tracer Token” button. There will be a list of subscriptions below. Monitor the “Publisher to Distributor” and “Distributor to Subscriber” values.

 

This method is only helpful in where times are somewhat real-time. If the topology is behind 3 days, then you will never see the tracer token make it to the subscriber in time, so the only thing you can see is if the Log Reader is keeping up or getting behind. That is still valuable information, but more trivial than if this is a problem where the subscriber is only 5 minutes behind and you can see the full flight of the token from start to finish.

 

If you’re seeing long time for “Publisher to Distributor” to populate, focus on the troubleshooting the LogReader performance.  If LogReader performance is fast while the “Distributor to Subscriber” values are high, transactions are moving quickly from the Published database transaction log to the Distribution database but slow from the Distribution database to the Subscribers.  The Distribution Agent should be the focus of the bottleneck investigation.

 

The following links provide sample code to measure latency over time.

 

 

Performance Monitor: You can also use Windows Performance Monitor's to track replication LogReader 'LogReader:Delivery Latency' counter or the replication Distribution 'Dist:Delivery Latency' counter. However, Windows Performance Monitor's Replication counter values are only updated after the Log Reader or Distribution Agent completes a processing phase and records performance stats.  The counter values will report 0 commands/sec and transactions/sec while a large transaction consisting of millions of rows is being processed.

Log Reader Reader-Thread Latency

 

General Troubleshooting:

Adding the “-OUTPUT” parameter to the Log Reader breaks down the time for both Reader (Fetch time), and the Writer (Write time)conversations. In the sample output below the Reader-Thread took 55 minutes while the Writer-Thread needed only 3 minutes to processes the transaction indicating Log Reader Reader-Thread is the bottleneck.

 

Replication Log Reader Agent

http://msdn.microsoft.com/en-us/library/ms146878.aspx

 

******************** STATISTICS SINCE AGENT STARTED ***************************
Execution time (ms): 3401930
Work time (ms): 3401470
Distribute Repl Cmds Time(ms): 3354520
Fetch time(ms): 3343920   <-- read time 55 minutes
Repldone time(ms): 5520
Write time(ms): 140070    <-- write time 3 minutes
   Num Trans: 8  Num Trans/Sec: 0.002352
   Num Cmds: 477219  Num Cmds/Sec: 140.297871
*******************************************************************************

 

Starting with SQL Server builds below the Log Reader Agent stats are displayed ever 5 minutes and written to the MSlogreader_history table in the Distribution database.

 

Cumulative Update 12 for SQL Server 2005 Service Pack 2

Cumulative Update 5 for SQL Server 2008

Cumulative Update 2 for SQL Server 2008 Service Pack 1

 

 

Possible Cause: large batch of replicated transaction

How large is the transaction Log for this published database?  Has it grown to an unexpected size causing longer read times by the Log Reader thread. 

 

DBCC SQLPERF(LOGSPACE)

 

The reader thread of the Log Reader executes sp_replcmds to pull commands from the transaction log.  The stored procedures below can be executed in SQL Server Management Studio to track time required to pull next pending replicated transaction.  How long do these command take to execute?  How many commands were returned?  Scanning a large transaction log for a single transaction containing a million+ row modifications  may take 20+ minutes to complete.  The execution time for this query approximates the startup time for the Log Reader to replicate this same transaction. 

 

Display 1st pending transaction metadata such as publication_id, article id, xactid, and command_type.  Is this command part of normal database operation or an unexpected large batch of commands?

 

sp_replcmds [ @maxtrans = ] maxtrans

http://msdn.microsoft.com/en-us/library/ms186983.aspx

 

example:

--Return all commands from the next pending transaction

sp_replcmds @maxtrans = 1

GO

 

--Return all commands for the next 500 transactions

sp_replcmds @maxtrans = 500

GO

sp_replflush

GO

 

Display 1st transactions metadata along with text of commands yet to replicate.  Commands type = 1073741871 are pending Tracer Tokens.

 

sp_replshowcmds [ @maxtrans = ] maxtrans

http://msdn.microsoft.com/en-us/library/ms175114.aspx

 

example:

--Return all commands from the next pending transaction with CMD text

sp_replshowcmds @maxtrans = 1

GO

 

--Return all commands for the next 500 transactions with CMD text

sp_replshowcmds @maxtrans = 500

GO

sp_replflush

GO

 

Another technique to see summary of a batch of pending transactions/commands is to query sys.dm_repl_traninfo view. This displays summary data on replicated transactions in the log last read by sp_replcmds or sp_replshowcmds, including the “cmds_in_tran.” A high value may indicate a large transaction is being replicated. 

 

sys.dm_repl_traninfo

http://msdn.microsoft.com/en-us/library/ms180036.aspx

 

sp_replcmds @maxtrans = 500  -- retrieve 500 pending transactions

GO

SELECT dbid, begin_lsn, commit_lsn, rows, cmds_in_tran

FROM sys.dm_repl_traninfo

GO

 

dbid   begin_lsn                        commit_lsn                       rows        cmds_in_tran

------ -------------------------------- -------------------------------- ----------- ------------

10     0000081A:0000013A:0001           0000081A:0000013A:0004           1           1

10     0000081A:00000140:0001           0000081A:00000140:0004           1           1

10     0000081A:00021385:0001           0000081E:00004EA2:0009           6874        6874

 

 

The following commands can be used to determine the number of non-replicated vs. replicated transactions.  A transaction log with a high percentage of non-replicated transaction will cause latency as the Log Reader scans over transaction to be ignored.  These can be database maintenance transactions such as an online reinidex or data modifications to table not being replicated.  A 25gig transaction log may take up to 15 minutes to scan depending on the IO subsystem.

 

Use <published database>

GO

-- Total records in the log

SELECT count(*) FROM ::fn_dblog(NULL, NULL)

GO

-- Records marked for REPLICATION

SELECT count(*) FROM ::fn_dblog(NULL, NULL) WHERE Description='REPLICATE'

GO

 

If you catch the large transaction in the act of being processed by sp_replcmds, you should also see the long run time of that event in SQL Profiler trace data. Using a captured Profile trace run command below to query the results:

 

SELECT duration, starttime, endtime, textdata

FROM ::fn_trace_gettable('C:\PUBLISHERDB_sp_trace.trc', DEFAULT)

WHERE TEXTDATA LIKE '%SP_REPLCMDS%' AND EVENTCLASS=10

 

        Duration    StartTime                  EndTime                     TextData
        388         2008-10-21 15:00:06.010    2008-10-21 15:00:06.010     exec sp_replcmds 500,0,-1,0x,5000,0,500000
        393         2008-10-21 15:00:11.010    2008-10-21 15:00:11.010     exec sp_replcmds 500,0,-1,0x,5000,0,500000
        397         2008-10-21 15:00:16.010    2008-10-21 15:00:16.010     exec sp_replcmds 500,0,-1,0x,5000,0,500000
        66302488    2008-10-21 15:00:21.010    2008-10-21 15:01:27.290     exec sp_replcmds 500,0,-1,0x,5000,0,500000
        413         2008-10-21 15:01:27.353    2008-10-21 15:01:27.353     exec sp_replcmds 500,0,-1,0x,66344,0,500000

 

Shown above, processing the large transaction causes sp_replcmds took 66 seconds to execute.

 

 

Resolution:

If Reader Latency is caused by large number of pending commands, waiting for the Log Reader to catch up may be the best short-term solution.  Long-term options include replicating batches during non-peak time or as stored procedure executions.

 

Publishing Stored Procedure Execution in Transactional Replication

http://msdn.microsoft.com/en-us/library/ms152754(SQL.90).aspx

 

Another option to consider is configuring the Log Reader to break up the one large transaction into multiple batches.  By committing large transaction in smaller batches we may be able to reduce the Log Reader overhead.  However, performance improvement would be most noticeable when Distributing data to subscribers in multiple parallel STREAMS.

 

Replication Log Reader Agent

http://msdn.microsoft.com/en-us/library/ms146878.aspx

 

Depending on estimated time to read pending transactions and transfer them to the subscribers, it may be faster to mark all transactions as “replicated”, then reinitialize the subscribers with a new snapshot or via backup/restore.  This is step should only be taken if the time to generated a new Replication Shapshot and deliver to subscriber is faster than waiting for individual pending commands to be replicated.

 

sp_repldone (Transact-SQL)

http://msdn.microsoft.com/en-us/library/ms173775.aspx

 

EXEC sp_repldone @xactid = NULL, @xact_segno = NULL, @numtrans = 0, @time = 0, @reset = 1

 

 

Possible Cause: large number of non-replication transactions:

This can be more challenging to identify. It helps to have a good knowledge of which articles are being replicated and what batch changes have been recently performed.  The BINARY column in the Profile Trace represents the number of rows processed by sp_replcmds.  Multiple executions with few rows usually indicates LogReader spent time skipping over non-replicated transactions such as large table performing an online reindex.

 

The output below from the LogReader log clearly shows many rows are being read while only a few are being replicated.

 

 

2008-11-11 23:41:35.527 The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 500000 log records have been scanned in pass # 3, 142 of which were marked for replication, elapsed time 22535010 (ms).

 

Resolution:

This is unavoidable when large number of transaction for non-replicating table.  If problem is caused by index maintenance log records,  consider performing offline reindexing.  Another option may be stopping and starting the online reindexing to create smaller batch of index update transactions.

 

 

Possible Cause: High number of VLFs:

A large number of Virtual Log Files (VLFs) can contributed to long running read times for the LogReader.  To determine the number of VLFs, execute the following command and note the number of segments (rows) returned.  Segment counts in 100K+ may be contributing to Log Reader Reader-Thread performance problems.

 

USE <published DB>

GO

DBCC LOGINFO

 

Resolution:

Shrink the transaction log to move extra VLFs.  Set the database “growth” to value which allows for growing without creating high number of VLFs

 

907511 How to use the DBCC SHRINKFILE statement to shrink the transaction log file in SQL Server 2005

http://support.microsoft.com/default.aspx?scid=kb;EN-US;907511

 

 

Possible Cause: Slow Read IO:

Slow disk subsystem may be contribution to the long read times.  Good overall measure of disk read performance is Windows Performance Monitor counter Avg Disk Sec/Read.

 

Resolution:

Read times in >15ms (.0015) may indicate an IO read bottleneck and should be investigated by the storage subsystem vendor.

 

Possible Cause: Slow Network IO:

For remote Distributors using a PULL Log Reader, the logread.exe is executing on the Distributor but Pulling rows from the Publishers transaction log.  Unexpected network IO problem will contribute to the Reader-Thread latency. 

 

Resolution:

Using Windows Performance monitor both the Publication and Distribution servers, check the Network Interface 'Output Queue Length' counter for the active NIC - the counter should consistently be under 2.

 

For more information please see the following MSDN topic titled 'Network Interface Object.'

http://msdn.microsoft.com/en-us/library/ms803962.aspx

 

Log Reader Writer-Thread Latency

 

Possible Cause: Blocking: High values in the Log Reader Agent -OUTPUT log for “Write time(ms):” indicate bottleneck writing commands to the Distribution database.  Blocking may be caused by another SQL Replication Agent such as a the Distribution Cleanup Agent.  Investigate blocking by the Log Reader Writer-Thread using SQL Server built in tools such as SQL Server Management Studio Activity Monitor, Performance Dashboard, Performance Database Warehouse, or stored procedures like sp_who.

 

For SQL 2005 download and install the Performance Dashboard.  This provide a high-level look at the Distributor performance.  Blocking, long running queries, high IO wait times are displayed in a graphical format.  For SQL 2008, these reports are included in the Performance Data Warehouse.

 

SQL Server 2005 Performance Dashboard Reports

http://www.microsoft.com/downloads/details.aspx?familyid=1d3a4a0d-7e0c-4730-8204-e419218c1efc&displaylang=en

 

 

Resolution:

If the “head blocker” is the Distribution cleanup, consider stopping this Agent, allowing data to get replicated, then restart cleanup during off-peak hours.  Blocking may indicate IO taking longer than expected to complete.  See “High IO” for additional troubleshooting steps.

 

 

Possible Cause: High IO: Check the Profile trace collected on the Distributor for the Duration and CPU values for the executions of sp_MSadd_replcmds. 

A high IO may indicate poor query plan.  Using a captured Profile trace read the CPU and Duration for the completed events using the command below. 

 

SELECT duration, starttime, endtime, textdata

FROM ::fn_trace_gettable('C:\DISTRIBUTOR_sp_trace.trc', DEFAULT)

WHERE TEXTDATA LIKE '%SP_MSadd_replcmds%' AND EVENTCLASS=10

 

Durations of 10x over the CPU times indicate resource contention. Running SQL Server Management Studio reports on the Distributor look for blocking of the Log Reader agent.

 

With SQL 2005 and 2008 system DMVs can also be used to retrieve the duration and CPU values for the Log Reader Writer-Thread. Low Duration and CPU but high Logical Reads may indicate a poor query plan caused by out of data table statistics.  The command below will retrieve the query plan along with the execution statistics.

 

--dm_exec_query_stats for Log Reader Writer-Thread sp_MSadd_replcmds

-- by top total_worker_time

SELECT TOP 25

  st.text, qp.query_plan,

    (qs.total_logical_reads/qs.execution_count) as avg_logical_reads,

    (qs.total_logical_writes/qs.execution_count) as avg_logical_writes,

    (qs.total_physical_reads/qs.execution_count) as avg_phys_reads,

  qs.*

FROM sys.dm_exec_query_stats as qs

         CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st

         CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp

WHERE st.text like 'CREATE PROCEDURE sp_MSadd_replcmds%'

  ORDER BY qs.total_worker_time DESC

Go

 

 

Possible Cause: Slow WRITE IO:

Slow disk subsystem may be contribution to the long write times.  Good overall measure of disk read performance is Windows Performance Monitor counter Avg Disk Sec/Write.

 

Resolution:

Write times in >20ms (.0020) may indicate an IO write bottleneck and should be investigated by the storage subsystem vendor.

 

Possible Cause: Network IO

Similar to situation above, you may want to investigate Network interface queue length should be confirmed. 

 

Distribution Agent Reader Latency

Like the Log Reader, the Distribution Agent is 2 simultaneous conversation going.  Delay in moving rows from the Distributor to the Subscriber may occur in either or both of these conversation.  The Reader-Thread conversation is responsible for querying the Distribution database for the list of transactions to apply at the subscriber.  It first queries the subscriber metadata table to find the last successfully replicated transactions.  Next the Reader-Thread executes sp_MSget_repl_commands to begin reading rows from the Distribution database and populates this rows into an in-memory queue.  As the rows get queued, the Distribution Agent Writer-Thread conversation begins writing these rows to the Subscriber.  Like with the Log Reader, adding the Distribution Agent -OUPUT parameter will provide good overview of the Distribution Agent performance.  As noted above, recent builds of SQL Server now automatically records this information in the Distribution Agent history table MSdistribution_history.

 

In this output the READS completes in 92ms at a rate of 565 cmds/sec while the WRITES are taking 14,000ms( 14 seconds ) and only processing 3.69 cmds/sec.  This example clearly shows the WRITE are slower then reads.

 

 

************************ STATISTICS SINCE AGENT STARTED ***********************

Total Run Time (ms) : 18828  Total Work Time  : 14110

Total Num Trans     : 52  Num Trans/Sec    : 3.69

Total Num Cmds      : 52  Num Cmds/Sec     : 3.69

Writer Thread Stats

  Time Spent on Exec        : 12063

  Time Spent on Commits (ms): 422  Commits/Sec         : 0.14

  Time to Apply Cmds (ms)   : 14110  Cmds/Sec            : 3.69   <-- Write Thread

  Time Cmd Queue Empty (ms) : 671  Empty Q Waits > 10ms: 2

  Total Time Request Blk(ms): 671

Reader Thread Stats

  Calls to Retrieve Cmds    : 2

  Time to Retrieve Cmds (ms): 92  Cmds/Sec            : 565.22   <-- Reader Thread

  Time Cmd Queue Full (ms)  : 5486  Full Q Waits > 10ms : 3

*******************************************************************************

 

 

Possible Cause: Large Batch of Transactions

Latency problem are often caused when series of transactions are trying to move a large batch of commands to the Subscribers.  The queries below show overall rows counts and index statistics for commands stored in the Distribution database.

 

--ROW COUNTS

--Look for high row counts (>1millions) indicate cleanup not running

--    or large pending transactions.

SELECT name, rowcnt, STATS_DATE (id, indid) as 'Last Update Stats'

FROM distribution.dbo.sysindexes

WHERE name IN('ucMSrepl_transactions', 'ucMSrepl_commands')

 

Are the row counts expected or do they now contain millions of rows?  High rows counts (>1 million) may indicate a large transaction is being processes or cleanup procedure is not running.

 

When performance troubleshooting latency a review of pending commands by day by # commands may uncover helpful pattern.  A breakdown of the commands being stored in the Distribution database can be retrieve by running the following quries.

 

-- Check the Time associated with those Transaction Counts into temp table

select t.publisher_database_id, t.xact_seqno,

      max(t.entry_time) as EntryTime, count(c.xact_seqno) as CommandCount

into #results

FROM MSrepl_commands c with (nolock)

LEFT JOIN  msrepl_transactions t with (nolock)

      on t.publisher_database_id = c.publisher_database_id

      and t.xact_seqno = c.xact_seqno

GROUP BY t.publisher_database_id, t.xact_seqno

 

 

-- Show each hour and number of commands per Day:

SELECT publisher_database_id

      ,datepart(year, EntryTime) as Year

      ,datepart(month, EntryTime) as Month

      ,datepart(day, EntryTime) as Day

      ,datepart(hh, EntryTime) as Hour

      --,datepart(mi, EntryTime) as Minute

      ,sum(CommandCount) as CommandCountPerTimeUnit

FROM #results

GROUP BY publisher_database_id

      ,datepart(year, EntryTime)

      ,datepart(month, EntryTime)

      ,datepart(day, EntryTime)

      ,datepart(hh, EntryTime)

      --,datepart(mi, EntryTime)

--order by publisher_database_id, sum(CommandCount) Desc

ORDER BY publisher_database_id, Month, Day, Hour

 

In the sample output below, a large batch of transactions were being replicated as result of table updates causing slowdown in the Distribution Agent.

 

publisher_database_id Year        Month       Day         Hour        CommandCountPerTimeUnit

--------------------- ----------- ----------- ----------- ----------- -----------------------

2                     2009        5           14          10          132

2                     2009        5           14          11          656

2                     2009        5           14          12          880

2                     2009        5           14          13          4379

2                     2009        5           14          14          152

2                     2009        5           14          15          1478

2                     2009        5           14          20          161

2                     2009        5           14          21          145

2                     2009        5           15          6           1700

2                     2009        5           15          7           3672

2                     2009        5           15          8           6266

2                     2009        5           15          9           329

2                     2009        5           15          10          5678715

2                     2009        5           15          11          5637959

2                     2009        5           15          12          5281732

2                     2009        5           15          13          5020950

2                     2009        5           15          14          1252

2                     2009        5           16          11          732

2                     2009        5           16          12          178

2                     2009        5           16          13          725

2                     2009        5           16          14          186

2                     2009        5           16          16          72

 

 

Resolution:

Large batch of transactions require heavy IO requirements by the Distribution Agent Reader-Thread on the distribution database.  Fast disk subsystem with Transaction log and database on separate drives/LUNs may help improve IO performance. If this will be an ongoing pattern consider replicating the stored procedure EXECUTION instead of the RESULTS.

 

Publishing Stored Procedure Execution in Transactional Replication

http://msdn.microsoft.com/en-us/library/ms152754(SQL.90).aspx

 

 

Troubleshooting Reader Thread: Distribution Agent latency in the Reader Thread occurs when retrieving rows from the DistributionDB. Obtain Agent history or -OUTPUT and examine the Agent stats which are generated every 5 minutes.

 

************************ STATISTICS SINCE AGENT STARTED ***********************

03-22-2009 09:55:19

 

Total Run Time (ms) : 59189906       Total Work Time  : 492741

Total Num Trans     : 5     Num Trans/Sec    : 0.01

Total Num Cmds      : 5     Num Cmds/Sec     : 0.01

Total Idle Time     : 58660470

 

Writer Thread Stats

  Total Number of Retries   : 0

  Time Spent on Exec        : 0

  Time Spent on Commits (ms): 16     Commits/Sec         : 0.05

  Time to Apply Cmds (ms)   : 0      Cmds/Sec            : 0.00

  Time Cmd Queue Empty (ms) : 528717         Empty Q Waits > 10ms: 5

  Total Time Request Blk(ms): 59189187

  P2P Work Time (ms)        : 0      P2P Cmds Skipped    : 0

 

Reader Thread Stats

  Calls to Retrieve Cmds    : 11738

  Time to Retrieve Cmds (ms): 492741         Cmds/Sec            : 0.01

  Time Cmd Queue Full (ms)  : 0      Full Q Waits > 10ms : 0

*******************************************************************************

 

This -OUTPUT taken from the Distribution Agent log shows 100% of the duration was spent read rows from the Distribution database before the Agent was stopped.

 

The Agent Log also records time when events completed.  In the example below, 4 minutes elapsed between time the Distribution Agent queries the Subscriber for the starting transaction to replicate and time it retrieved a batch of transactions from the Distribution database. The latency is caused by Reader Threads are waiting for CMDs to be returned from the Distribution DB. 

 

 

--Example from Distribution Agent -OUTPUT showing delay in READER thread.

--Read for Subscription metadata completes

2009-05-16 01:26:49.229 OLE DB Distributor 'SQL380': {call sp_MSget_subscription_guid(11)}

--Four minutes later read of replication cmds completes.

2009-05-16 01:29:31.401 sp_MSget_repl_commands timestamp  value is: 0x0x0000081a00000045000400000000

--Reader threads are waiting for CMDs to be returned from the Distribution DB. 

 

To test the READER execute times, run sp_MSget_repl_commands from Query Window, look for Blocking, table scans, timeouts, etc.

 

1. From the -OUTPUT get the LSN for the long runing sp_MSget_repl_commands

 

         2009-05-16 01:29:31.401 sp_MSget_repl_commands timestamp  value is: 0x0x0000081a00000045000400000000

 

2. Get the Agent_ID from the same log

 

3. Execute sp_MSget_repl_commands query against Distribution DB

 

     sp_MSget_repl_commands @agent_id=11, @last_xact_seqno=0x0000081a00000045000400000000

 

 

How long did this command take to execute?  Was it blocked?  Were they are large number of CMDs to be returned?

 

You can also retrieve the last successful transaction from the subscriber’s database using query below. The transaction_timestamp value contains the Log Sequence Number (LSN) used for the sp_MSget_repl_commands stored procedure.

 

USE <subscriber db>

--Retrieve last successful Transaction

--    Multiple rows per publication indicate parallel distribution streams

SELECT publisher,publication, distribution_agent,transaction_timestamp

FROM dbo.MSreplication_subscriptions

 

 

For even more details, examine the execution plan using query below.

 

USE distribution

GO

 

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

 

SET STATISTICS PROFILE ON

SET STATISTICS IO ON

SET STATISTICS TIME ON

 

sp_MSget_repl_commands @agent_id=11, @last_xact_seqno=0x0000081a00000045000400000000

 

SET STATISTICS PROFILE OFF

SET STATISTICS IO OFF

SET STATISTICS TIME OFF

 

In SQL 2005/2008 we can pull the CACHED execution plans from the DMVs.

 

--dm_exec_query_stats for sp_MSget_repl_commands

-- by top total_worker_time

SELECT TOP 25

  st.text, qp.query_plan,

    (qs.total_logical_reads/qs.execution_count) as avg_logical_reads,

    (qs.total_logical_writes/qs.execution_count) as avg_logical_writes,

    (qs.total_physical_reads/qs.execution_count) as avg_phys_reads,

  qs.*

FROM sys.dm_exec_query_stats as qs

         CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st

         CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp

WHERE st.text like 'CREATE PROCEDURE sys.sp_MSget_repl_commands%'

  ORDER BY qs.total_worker_time DESC

Go

 

Possible Cause: Table Statistics

Long running Reader-Thread query when transactions containing many small batches may also indicate Replication system tables need to have statistics updates.  Table statistics can become out-of-date if AutoUpdateStatistics has been disable for the Distribution database. 

 

--Updated Replication table statistics 'Updated' date

DBCC SHOW_STATISTICS(MSrepl_commands,ucMSrepl_commands)

DBCC SHOW_STATISTICS(MSrepl_transactions,ucMSrepl_transactions)

 

--Retrieve Distribution database settings

--    Look for IsAutoUpdateStatistics

sp_helpdb distribution

 

--or (returns 1)

select DATABASEPROPERTYEX('distribution','IsAutoUpdateStatistics')

 

 

Resolution:

 

--Updated Replication table statistics

UPDATE STATISTICS MSrepl_transactions

   WITH FULLSCAN

 

UPDATE STATISTICS MSrepl_commands

   WITH FULLSCAN

 

Resolution:

Using built-in SQL tools, verify Distribution Agent Reader-Thread is not being blocked at the Distribuiton database.

 

Distribution Agent Writer Latency

The Writer-Thread calls Replication created stored procedures to apply changes to the subscriber.  These sp’s rely on a user table’s unique index (usually primary key) to locate records to be updated or deleted.  Latency in the Writer thread is usually see in long-running execution time for these stored procedures.

 

sp_MSIns_articlename

sp_MSUpd_articlename

sp_MSDel_articlename

 

 

Possible Cause: Long execution times in the INS/UPD/DEL stores procedures used to apply transactions to the subscriber.

 

In the sample output below, the Distribution Agent latency is occurring in the WRITER thread of the Distribution Agent Latency

 

--dm_exec_query_stats for sp_MSupd (Subscriber Updates)

-- by top total_worker_time

SELECT TOP 250

  st.text, qp.query_plan,

    (qs.total_logical_reads/qs.execution_count) as avg_logical_reads,

    (qs.total_logical_writes/qs.execution_count) as avg_logical_writes,

    (qs.total_physical_reads/qs.execution_count) as avg_phys_reads,

  qs.*

FROM sys.dm_exec_query_stats as qs

         CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) as st

         CROSS APPLY sys.dm_exec_query_plan(qs.plan_handle) as qp

WHERE st.text like '%sp_MSupd_dboemployee%'  -- sp_MSupd_dbo<table name>%'

  ORDER BY qs.total_worker_time DESC

 

Go

 

If the stored procedures have already been flushed from cache or SQL Server has been restarted, a Profile trace on the Subscriber taken during the problem will provide the summary execution data.  Once loaded by SQL Nexus or fn_ReadTrace the stored procedure below will summaries the stored procedures called the Distribution Agent.

 

-- measure ALL spids for performance

select hashid, count(*) as Num, sum(duration) as sumdur, sum(cpu) as sumcpu, sum(reads) as sumreads, sum(writes) as sumwrites, avg(duration) as avgdur, avg(cpu) as avgcpu, avg(reads) as avgreads, avg(writes) as avgwrites

INTO ReadTrace.topCPUBatch

FROM ReadTrace.tblBatches

group by hashid

order by sumcpu desc

GO

 

-- Get the text for the top CPU batches

select b.*,a.NormText

FROM ReadTrace.tblUniqueBatches a

JOIN ReadTrace.topCPUBatch b

ON a.hashid=b.hashid

where NormText like '%SP_MSGET_REPL_COMMANDS%'

--OR NormText like '%SP_MSHELP_DISTRIBUTION_AGENTID%'

--OR NormText like '%SP_MSGET_SUBSCRIPTION_GUID%'

OR NormText like '%SP_MSins%'

OR NormText like '%SP_MSupd%'

OR NormText like '%SP_MSdel%'

order by sumcpu desc

 

Example Output


hashid               Num sumdur  sumcpu      sumreads    sumwrites   avgdur  avgcpu avgreads    avgwrites   NormText
-------------------- --- ------- ----------- ----------- ----------- ------- ------ ----------- ----------- -------------------------------
-5684440255897011116 69  966058  389         4071        0           14000   5      59          0           SP_MSGET_REPL_COMMANDS
-2908468862153496396 69  1466258 204         5175        0           21250   2      75          0           SP_MSHELP_DISTRIBUTION_AGENTID
-7220247869142477571 69  539588  31          1152        0           7820    0      16          0           SP_MSGET_SUBSCRIPTION_GUID

 

 

 

Possible Cause:

In the sample output below, the Distribution Agent latency is occurring in the WRITER thread of the Distribution Agent Latency

 

************************ STATISTICS SINCE AGENT STARTED ***********************

Total Run Time (ms) : 18828  Total Work Time  : 14110

Total Num Trans     : 52  Num Trans/Sec    : 3.69

Total Num Cmds      : 52  Num Cmds/Sec     : 3.69

Writer Thread Stats

  Time Spent on Exec        : 12063

  Time Spent on Commits (ms): 422  Commits/Sec         : 0.14

  Time to Apply Cmds (ms)   : 14110  Cmds/Sec            : 3.69

  Time Cmd Queue Empty (ms) : 671  Empty Q Waits > 10ms: 2

  Total Time Request Blk(ms): 671

Reader Thread Stats

  Calls to Retrieve Cmds    : 2

  Time to Retrieve Cmds (ms): 92  Cmds/Sec            : 565.22

  Time Cmd Queue Full (ms)  : 5486  Full Q Waits > 10ms : 3

*******************************************************************************

 

In this output the READS completes in 92ms at a rate of 565 cmds/sec while the WRITES are taking 14,000ms( 14 seconds ) and only processing 3.69 cmds/sec.  This example clearly shows the WRITE are slower then reads.

 

Possible Cause: User Defined Triggers: High values in the Log Reader Agent -OUTPUT log for “Write time(ms):” indicate bottleneck writing commands to the Distribution database.  Once possible cause are user defined triggers adding theFirst investigate blocking caused by other Replication Agent. 

 

 

 

Possible Cause: Not Replicating SQL statements as “Parameters” 

 

Obtain the create publication script and check ALL articles to see if the SQL statements are being sent as batch of 'parameters' statements.  The status = 8 indicates 'parameters' batching is not enabled.

 

    exec sp_addarticle @publication = ... @status = 24

or

  select name, status from sysarticles

 

Article status can be updated on-demand using the following Replication stored procedure.  This should be executed on all articles.

 

EXEC sp_changearticle @publication = N'<pub name>', @article = N'<article name>',

                      @property = 'status',  @value = 'parameters'

 

Capture Profile RPC:COMPLETE events on the Subscriber.  Look for the time difference between StartTime and EndTime and/or Duration.  Statements should be completing around 100ms.  Longer execution times for single row update/inserts may indicate blocking on the subscriber or user-defined triggers on the subscriber tables contributing to the high durations.

 

Look at the time difference between when on RPC:COMPLETES and the next PRC:STARTING.  This is the "round trip" time back to the Distribution Agent to say that command completed and next commands is received.  Times difference larger then 500ms may indicate network latency. 

 

 

Also examine time from when READER thread finished reading a batch of commands until these commands were applied at the Subscriber. Long times between these 2 events also indicate long WRITER thread. 

 

--reader thread completed -OUTPUT entry

2009-05-14 01:07:50.795 sp_MSget_repl_commands timestamp value is: 0x0x0000081a00000033000400000000

--writeer thread commit -OUTPUT entry

2009-05-14 01:13:50.011 Parameterized values for above command(s): {{N'ZZZZZZZZ', 1}, {N'XXXXXXXX', 2}, {N'YYYYYYYY', 3}}

 

When troubleshooting long Writer Thread look focus on the time to execute the UPDATE/INSERT/DELETEs on the Subscriber.  Execution times can be found in either PSSDIAG/Profiler trace or from execution DMVs.

 

         Execution time (ms): 3401930
               Work time (ms): 3401470
Distribute Repl Cmds Time(ms): 3354520
               Fetch time(ms): 3343920
Repldone time(ms): 5520
               Write time(ms):  140070
   Num Trans: 8  Num Trans/Sec: 0.002352
   Num Cmds: 477219  Num Cmds/Sec: 140.297871