In this blog I will discuss a connection timeout issue that one of my customers was having after failing over the mirror database. For better understanding of the issue let me discuss some key terms/concepts of database mirroring before I get into the issue description. Database mirroring is first introduced in SQL Server 2005 for increasing database availability. It is software solution and does not require any hardware like clustering. Database mirroring involves redoing every insert, update, and delete operation that occurs on the principal database onto the mirror database as quickly as possible. Redoing is accomplished by sending a stream of active transaction log records to the mirror server, which applies log records to the mirror database, in sequence, as quickly as possible.

If a SQL server database is configured with a mirrored database and the principal database becomes unavailable for some reason then we can manually or automatically failover to the mirrored database.

To support automatic failover, a database mirroring session must be configured in high-safety mode and also possess a third server instance, known as the witness. To know more about how SQL server mirroring works please read the following MS links.

Database Mirroring in SQL Server 2005

http://technet.microsoft.com/en-us/library/cc917680.aspx

How to: Configure a Database Mirroring Session (SQL Server Management Studio)

http://msdn.microsoft.com/en-us/library/ms188712.aspx

Now let’s get into the detail of the issue I want to discuss today. Customer was observing timeout error from the application after shutting down the principal server. They had setup the database mirroring as “high safety with automatic failover mirror” using a witness and were also specifying the Failover Partner and a connection timeout of 10 seconds in the connection string. During testing, when they turned off the SQL Server service on the principal server the auto failover worked like a charm, but when they took the principal server offline (by shutting down the server or killing the network card) failover happened as expected (i.e. the mirror becomes the new principal) but the application that was trying to connect to the SQL server mirrored database was getting a timeout error as below.

Server Error in '/' Application.

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The wait operation timed out.)

Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

Exception Details: System.Data.SqlClient.SqlException: A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - The wait operation timed out.)

So once again the problem is if they stop only the SQL server service mirroring works as expected but when they shut down the SQL Server (principal) the connections were timing out and the application was not able to connect to the mirrored database as expected.

Without digging into this further we requested the customer to increase the connection time out to 60 seconds and test. Customer increased the connections time and now even after shutting down the SQL server the application was able to connection to the mirrored database successfully.

So we were able to find a work around for the issue pretty quick but we still needed to explain what was causing the issue and why increasing the connection timeout fixed it. To find out the reason behind these different behaviors we captured network traces in all three scenarios.

Scenario 1: We did a failover from the principle to the mirror by only stopping the SQL server service. In this scenario the application was able to connect to the mirror database successfully.

1. To initiate a connection the client sends SYN  to the principal and the principle server sends ACK, RST right away

clip_image002

2. As the client gets the response from the principle almost immediately it tries to connect to the mirror and the connection becomes successful.

clip_image004

Scenario 2: We did a failover by shutting down the principal server and the connection timeout was set to 10 sec. In this scenario the application was failing to connect to the SQL server and displaying a timeout error.

1. To initiate a connection client sends SYN to the principal and it gets nothing back from the principle as the principle is not up.

2. So the client retransmits the SYN to the principle based on the value in TcpMaxConnectRetransmissions registry key. TcpMaxConnectRetransmissions specifies how many times TCP retransmits an unanswered request for a new connection. TCP retransmits new connection requests until they are answered or until this value expires. The default value of TcpMaxConnectRetransmissions is 2.

TCP/IP adjusts the frequency of retransmissions over time. The delay between the original transmission and the first retransmission for each interface is determined by the value of the TcpInitialRTT entry. By default, it is three seconds. This delay doubles after each attempt. After the final attempt, TCP/IP waits for an interval equal to double the last delay, and then it abandons the connection request.

Please keep in mind that TcpMaxConnectRetransmissions is different from TcpMaxDataRetransmissions which specifies how many times TCP retransmits an unacknowledged data segment on an existing connection.

In the network trace we see the application server retransmitted SYN after 3 seconds. TcpMaxConnectRetransmissions was set to two so the next retransmit should be after 6 more seconds and 3+6= 9 sec from the beginning of the connection and after that we should wait another 6*2=12 seconds before trying to connect to the mirror database. Customer set the connection timeout as 10 seconds. For some reason I did not see the 2nd retransmit packet in the trace but regardless the connection was timed out. We did not see any attempt in the trace to make a connation to the mirror database.

clip_image006

Scenario 3: We did a failover by shutting down the principal database but the connection timeout in the connection string was increased from 10 seconds to 60 seconds. In this scenario the application was able to connect to the mirror database successfully.

1. To initiate a connection client sends SYN to principal and it gets nothing back from the principle as the principle is not up.

2. So client retransmits the SYN to the principle after 3 seconds. The default value for TcpMaxConnectRetransmissions is set to 2. It should retransmits the SYN after 6 more seconds. Then it should wait for another 6*2=12 sec and then try to connect to the mirror database.

3. For some reason we are not seeing the 2nd retransmits the SYN from the app server to the principal in the trace. However, we as expected exactly after 6+12=18 sec the app server sent SYN to the mirror server to initiate the connection. It got response right away and the connection was successful. From the beginning it took total 3+6+12=21 secs to attempt to connect to the mirror database which is less than the new connection timeout value 60 seconds and this why we did not see any connection timeout this time.

clip_image008

To recap in the first scenario we did not see a timeout because even though the SQL server service was stopped the server was up and sent a RESET to the client right away and the application server was able to establish a connection with the mirror database before connection timeout 10 seconds. In the second scenario we were getting the timeout because the principal server was shutdown and therefore the application server was not getting any response at all from the SQL server and had to do 2 retransmits based on the default TcpMaxConnectRetransmissions value before it could try to connect to the mirror database. The connection timeout value was too and small (10 sec) and it timed out while doing those retransmits to the principal server and never had a chance to try to connect to the mirror database.

In the 3rd scenario as we increased the connection timeout the application server was able to try to connect to the mirror database after done retransmitting the SYN packets to the principal server and the connection was successful.

References:

How to: Configure a Database Mirroring Session (SQL Server Management Studio)

http://msdn.microsoft.com/en-us/library/ms188712.aspx

Making the Initial Connection to a Database Mirroring Session

http://msdn.microsoft.com/en-us/library/ms366348.aspx

Connection Retry Algorithm (for TCP/IP Connections)

http://msdn.microsoft.com/en-us/library/ms365783.aspx

TcpMaxConnectRetransmissions

http://technet.microsoft.com/en-us/library/cc758896(WS.10).aspx

TcpMaxDataRetransmissions

http://technet.microsoft.com/en-us/library/cc780586(WS.10).aspx

 

Author: MFarooq [MSFT]