CSS SQL Server Engineers

This is the official team Web Log for Microsoft Customer Service and Support (CSS) SQL Support. Posts are provided by the CSS SQL Escalation Services

How It Works: Failover Cluster/Availability Group XEL Logging Frequency

How It Works: Failover Cluster/Availability Group XEL Logging Frequency

  • Comments 2

I had a great question from my post on sp_server_diagnostics (http://blogs.msdn.com/b/psssql/archive/2012/03/08/sql-server-2012-true-black-box-recorder.aspx).  As 'luck' would have it, Bob Ward, was working on some training and had a similar question.  As I investigated I uncovered a few details that can help us all.

Connection Scope

As soon as the Failover Cluster Instance (FCI) or Availability Group is signaled by the cluster manager to come 'ONLINE', a connection is established to SQL Server.  The connection is persistent, just reading the next results set, one after the other, from the sp_server_diagnostics output stream.

Result Set Interval Response

The sp_server_diagnostics interval parameter is controlled by the HealthCheckTimeout value.

image

Formula:  Interval = HealthCheckTimeout / 3;                           //  1/3 of the Health check timeout for the result set response interval

XEL Logging Interval

Here is what confused me the first time I looked at the .\LOG*SQLDIAG*.xel trace for the component_health_result events.

If I establish a performance monitor trace, capturing every 10 seconds, and found a gap of 100 seconds I would troubleshoot the issue as an overall system, responsiveness problem.   This is not the case for gaps in this XEL data.

The messages don't always show up at the 1/3 interval boundaries.   Take the example shown above, the results are returning on 10 second intervals.  Failure to return the result set, on the intervals, could trigger the Server Hang failover level and you may see additional INFO messages logged.

The logic is such that it tries to conserve space in the XEL file.  For example, if the state is CLEAN and the last state was CLEAN the logging may choose to skip output of the duplicate event data.   In the case of our example, the specific code I am studying today could skip the output ~10 times (100 sec) before recording the CLEAN state in the XEL file again.  All it means is the state didn't change, it was CLEAN the entire time.   If any state changes the events are always output.

Don't make the assumption that a gap in the events directly indicates a level of system instability.

Changing Interval

When you change the properties the resource dll (hadrres.dll) establishes a second connection and starts executing sp_server_diagnostics as the new interval.   Once the connection has been successfully established and the query started the previous connection is disconnected.  This insures we never have a monitoring gap.  

If the second connection can't be established the original connection continues monitoring with the prior settings.  This makes sure your server is always being monitored for failover conditions.

Bob Dorr - Principal SQL Server Escalation Engineer

 

 

Leave a Comment
  • Please add 8 and 3 and type the answer here:
  • Post
  • I cant find any BOL reference to the LeaseTimeout property. please explain what is that and where to find the info.

  • I will try to put a blog post for LeaseTimeout.

    LeaseTimeout is a named set of events used between the resource DLL and SQL Server to maintain a lease.  The events are signaled, in a handshake exchange, 1/3 of the lease timeout setting.

    If the handshake does not occur before the lease timeout the lease is expired and the resource dll no longer considers the state 'online' and will initiate corrective actions.

    For example, if the SQL Server instance detects the lease has not been updated by the cluster manager side of the handshake it will disable COMMIT capabilities to prevent the possibility of split brain scenarios.

Page 1 of 1 (2 items)