Connectivity troubleshooting in SQL Server 2008 with the Connectivity Ring Buffer

Connectivity troubleshooting in SQL Server 2008 with the Connectivity Ring Buffer

Rate This
  • Comments 6

SQL Server 2008 contains a new feature designed to help troubleshoot particularly tricky connectivity problems. This new feature, the Connectivity Ring Buffer, captures a record for each server-initiated connection closure, including each kill of a session or Login Failed event. To facilitate efficient troubleshooting, the ring buffer attempts to provide information to correlate between the client failure and the server’s closing action. Up to 1k records in the ring buffer are persisted for as long as the server is online, and after 1000 records, the buffer wraps around and begins replacing the oldest records. The Connectivity Ring Buffer’s records are exposed by a DMV which can be queried:

SELECT CAST(record AS XML) FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = 'RING_BUFFER_CONNECTIVITY'

This will select all records as XML types; in Management Studio, you can click the records to get a more readable version. If you want to use SQL queries on the XML records to locate your particular problem, you can also use SQL Server’s XML support to turn this into a temp table and query against the records.

A basic ring buffer entry: killed SPID

An easy way to cause a server-initiated connection closure is to open two connections to a SQL Server, find the SPID of one of the connections, and then kill that spid from the other connection:

C:\>osql -E
1> SELECT @@spid
2> go
 ------
     51
(1 row affected)

C:\>osql -E
1> kill 51
2> go
1>

If you try this and then query the ring buffer, you will get a result which will look like this:

<Record id="2" type="RING_BUFFER_CONNECTIVITY" time="110448275">

  <ConnectivityTraceRecord>

    <RecordType>ConnectionClose</RecordType>

    <RecordSource>Tds</RecordSource>

    <Spid>55</Spid>

    <SniConnectionId>B7882F3C-3BA9-45A7-8D23-3C5C05F9BDF9</SniConnectionId>

    <SniProvider>4</SniProvider>

    <RemoteHost>&lt;local machine&gt;</RemoteHost>

    <RemotePort>0</RemotePort>

    <LocalHost />

    <LocalPort>0</LocalPort>

    <RecordTime>5/6/2008 22:47:35.880</RecordTime>

    <TdsBuffersInformation>

      <TdsInputBufferError>0</TdsInputBufferError>

      <TdsOutputBufferError>0</TdsOutputBufferError>

      <TdsInputBufferBytes>60</TdsInputBufferBytes>

    </TdsBuffersInformation>

    <TdsDisconnectFlags>

      <PhysicalConnectionIsKilled>0</PhysicalConnectionIsKilled>

      <DisconnectDueToReadError>0</DisconnectDueToReadError>

      <NetworkErrorFoundInInputStream>0</NetworkErrorFoundInInputStream>

      <ErrorFoundBeforeLogin>0</ErrorFoundBeforeLogin>

      <SessionIsKilled>1</SessionIsKilled>

      <NormalDisconnect>0</NormalDisconnect>

      <NormalLogout>0</NormalLogout>

    </TdsDisconnectFlags>

  </ConnectivityTraceRecord>

  <Stack>

    <frame id="0">0X01CA0B00</frame>

    <frame id="1">0X01CA0DB1</frame>

    <frame id="2">0X01DF6162</frame>

    <frame id="3">0X02E53C98</frame>

    <frame id="4">0X02E54845</frame>

    <frame id="5">0X02E57BE9</frame>

    <frame id="6">0X02E38F57</frame>

    <frame id="7">0X02E3B2C0</frame>

    <frame id="8">0X02E3C832</frame>

    <frame id="9">0X02E3D55E</frame>

    <frame id="10">0X781329BB</frame>

    <frame id="11">0X78132A47</frame>

  </Stack>

</Record>

Different record types contain some different information. The three record types exposed by the Connectivity Ring Buffer are: ConnectionClose, Error, and LoginTimers. This record type is a ConnectionClose since it is not a connection close during Login, either caused by timeout or some other login failure scenario:
    <RecordType>ConnectionClose</RecordType>

We can see that it was the connection with SPID 55 that was closed:
    <Spid>55</Spid>
We can see that the connection was local (the text in this tag is an XML-encoded string reading “<local machine>”, which indicates a local, shared memory connection):
    <RemoteHost>&lt;local machine&gt;</RemoteHost>
More useful correlation information is available when the TCP protocol is used for connecting to SQL Server – then the local IP address and port, and remote IP address and port will be included in the record, allowing you to uniquely identify the client machine and application to which the record applies. Additionally, the ring buffer contains the timestamp and whenever possible the SPID of the client process to make the correlation seamless (since, as you probably know, SPIDs may be reused over time by different connections).

 

We can also see exactly how many bytes were in the TDS packet sent by the client, and whether there was any error found in the TDS:
      <TdsInputBufferError>0</TdsInputBufferError>
      <
TdsOutputBufferError>0</TdsOutputBufferError>
      <
TdsInputBufferBytes>60</TdsInputBufferBytes>
The most relevant, easy-to-analyze piece of information for a ConnectionClose record is in the TdsDisconnectFlags, where there are a number of values telling the state of the closed connection. Here we see that no error was found, but that it was also not a normal disconnect or a normal logout for the session. What we do see is a flag indicating that the session was killed:
      <SessionIsKilled>1</SessionIsKilled>

A more interesting example: DC connectivity troubles

Tracking killed SPIDs is fine, but the goal is to help out on tricky problems, before resorting to network traces. Here’s an example of a Login Timers record from the Connectivity Ring Buffer, for a scenario that has up until now been very hard to pinpoint without the costly process of reproducing the problem and taking and analyzing network captures:

<Record id="3" type="RING_BUFFER_CONNECTIVITY" time="112254962">

  <ConnectivityTraceRecord>

    <RecordType>LoginTimers</RecordType>

    <Spid>0</Spid>

    <SniConnectionId>B401B045-3C82-4AAC-A459-DB0520925431</SniConnectionId>

    <SniConsumerError>17830</SniConsumerError>

    <SniProvider>4</SniProvider>

    <State>102</State>

    <RemoteHost>&lt;local machine&gt;</RemoteHost>

    <RemotePort>0</RemotePort>

    <LocalHost />

    <LocalPort>0</LocalPort>

    <RecordTime>5/6/2008 23:17:42.556</RecordTime>

    <TdsBuffersInformation>

      <TdsInputBufferError>0</TdsInputBufferError>

      <TdsOutputBufferError>232</TdsOutputBufferError>

      <TdsInputBufferBytes>198</TdsInputBufferBytes>

    </TdsBuffersInformation>

    <LoginTimers>

      <TotalLoginTimeInMilliseconds>21837</TotalLoginTimeInMilliseconds>

      <LoginTaskEnqueuedInMilliseconds>0</LoginTaskEnqueuedInMilliseconds>

      <NetworkWritesInMilliseconds>2</NetworkWritesInMilliseconds>

      <NetworkReadsInMilliseconds>77</NetworkReadsInMilliseconds>

      <SslProcessingInMilliseconds>3</SslProcessingInMilliseconds>

      <SspiProcessingInMilliseconds>21756</SspiProcessingInMilliseconds>

      <LoginTriggerAndResourceGovernorProcessingInMilliseconds>0</LoginTriggerAndResourceGovernorProcessingInMilliseconds>

    </LoginTimers>

  </ConnectivityTraceRecord>

  <Stack>

    <frame id="0">0X01CA0B00</frame>

   

    <frame id="15">0X02E3C832</frame>

  </Stack>

</Record>

For this situation, on the client we see:
[SQL Server Native Client 10.0]Shared Memory Provider: Timeout error [258].
[SQL Server Native Client 10.0]Login timeout expired
[SQL Server Native Client 10.0]Unable to complete login process due to delay in login response

Getting the OS error message included doesn’t tell us anything:
C:\>net helpmsg 258
The wait operation timed out.

And in the server errorlogs, there is nothing at all. The record from the ring buffer, however, is very interesting. LoginTimers records show the overall processing time:
      <TotalLoginTimeInMilliseconds>21837</TotalLoginTimeInMilliseconds>

This time is then broken into components based on what work was happening at the time (the numbers won’t always add up because of rounding; here they’re off by 1). The interesting component, of course, is the one or ones that took a long time. In this case, the SspiProcessingInMilliseconds looks interesting, at almost 22 seconds:
      <SspiProcessingInMilliseconds>21756</SspiProcessingInMilliseconds>

SSPI, the Security Support Provider Interface, is the interface used by SQL Server for Windows Authentication. When the Windows login used is a domain account, SQL Server uses SSPI to interact with the Domain Controller to authenticate the user. The indication in this record of a very long time spent in SSPI processing indicates that there are latency issues talking to the Domain Controller, likely caused by some faulty network hardware between the SQL Server and DC, or perhaps software problems with the DC. Without even taking network traces, and without having to reproduce the problem (since the Connectivity Ring Buffer is turned on by default), the problem is narrowed down to SQL Server’s interaction with the Domain Controller.

Trace Flags

The Connectivity Ring Buffer is on by default, and by default traces all server-initiated connection closures. If you experience an error on the client but see no entry in the ring buffer, this means that the server has seen a connection closure (reset) similar to normal connection closure from the client, or an erratic one coming from outside the server; for instance, from a malfunctioning piece of networking hardware. If this is the case, you should focus on potential networking issues. If you see an entry in the ring buffer, then it will contain information about why the server decided to close this connection, and it will hopefully be enough to facilitate further troubleshooting. For example, if you see that a connection is closed due to invalid information in the TDS packet, you can investigate for faulty network cards, routers and switches which might have corrupted the data. As you will see below, by using a trace flag you can enable logging of all connection closures and thus observe the exact client-initiated closure conditions and potential errors.

There are two trace flags to change the behavior of the Connectivity Ring Buffer. 

To completely disable the Connectivity Ring Buffer, globally enable trace flag 7826:

DBCC TRACEON (7826, -1)

Client-initiated connection closures are not traced by default because they are a normal usage pattern, rather than always an error case; when a client is done with its session, it disconnects. It is recommended not to trace client-initiated connection closures since they may cause you to miss a real error case when it occurs, either by eventually writing over the error record when the buffer rolls around (which will happen faster if you have many correctly-behaving clients opening and closing connections), or just by hiding the error in a long list of normal connection closures. However, the option to trace them, in addition to the above events, is available by globally enabling trace flag 7827:

DBCC TRACEON (7827, -1)

What are all those <frame> tags?

The sys.dm_os_ring_buffers DMV gives access to a variety of internal information, beyond just the Connectivity Ring Buffer, and as part of the infrastructure of this DMV, most of these ring buffers provide the stack trace where the event occurred. The </Stack> tag indicates the beginning of a stack trace, and each <frame> gives the hexadecimal address of a function. These can be resolved into function names by taking a dump of the sqlservr.exe process, opening the dump in Windbg, and using the lm command on the function address.

Dan Benediktson & Jivko Dobrev
SQL Server Protocols

Disclaimer: This posting is provided "AS IS" with no warranties, and confers no rights

Leave a Comment
  • Please add 5 and 5 and type the answer here:
  • Post
  • What are the debug options for 2005?

    I have exactly that 258 error, and need to find out the root cause.

  • Sent: Monday, March 09, 2009 6:20 PM Subject: Ring Buffers in SQL2008 &#160; Hi, do you have any documentation/white

  • Q:对于复杂的连接问题,我如何获得更多的信息以用于纠错?

    A:

    SQLServer2008包含了一个用于帮助特别困难的连接问题纠错的新特性-连接环缓冲区(ConnectivityRing...

  • Great entry, thanks a lot for putting your time into this!

  • We are seeing lots of cases where most of the delay is due to LoginTriggerAndResourceGovernorProcessingInMilliseconds. Can you direct us to where we can find more info on what could cause this? We are using SQL 2008 R2 on Win 2008. Thank You

  • Hi,

    I am trying to troubleshoot some intermittent connectivity errors to my Sql server 2008 R2 RTM version and used the sys.dm_os_ring_buffers for this task. However, in the output the record time displays wrong time even though the client and server machines have same system time and are in the same domain. It is 21st June 8:30 am EST here but the record time shows the following values:

    2012-06-21 10:21:59.110

    2012-06-21 10:21:59.110

    This is weird. My client uses a legacy odbc connection (System DSN). Any ideas why the time recorded is wrong? Is my SQL machine receiving wrong packets?

    Please suggest some solution.

    Thanks

    Chandan

Page 1 of 1 (6 items)