Intermittent connectivity issues from a client machine when connecting to a remote database server can be often attributed to the network layer. By collecting the network traces and reviewing them we can tell why the connections failed to succeed or if the existing connections were getting closed. Recently I worked on a case where the customer was seeing intermittent GNE from an application that was trying to connect to the SQL Server remotely. Here is the error message they were getting.
OleDB Error Microsoft OLE DB Provider for SQL Server [0x80004005] : [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
After doing some initial troubleshooting like checking if the TCP chimney was disabled, if the application server was running out of max client port number etc. we collected simultaneous Network traces from both the application server and the SQL server. When you collect network traces always make sure to collect the following two as well.
1. ipconfig /all from all the servers involved.
2. Exact error message that shows the timestamp. If an error message with the timestamp is not available then request to record the exact time of the issue and send to you.
Below I will show step by step how we reviewed the collected network traces. You can use the same techniques to analyze the traces collected for the intermittent connectivity issue to the SQL server.
You can use Netmon or wireshark to capture and review the traces. Personally I prefer to use Netmon. However, in this case we used wireshark since traces sent by the customer could not be opened using the Netmon tool (due to a different file extension). You can download the Netmon tool or wireshark from the web.
IP addresses and exact time stamp of the issue:
Let’s check the IPconfig’s first to find out the IP addresses.
IP Address. . . . . . . . . . . . : 10.10.100.131
IP Address. . . . . . . . . . . . : 10.10.100.59
Now find out when exactly the issue happened from the error message.
02/24/2010 09:28:08 DataBase Warning OleDB Error Microsoft OLE DB Provider for SQL Server [0x80004005] : [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
So the issue happened at 09:28:08 on 02/24/2010
Often time you will get significant number of trace files from each server when collecting network traces for intermittent issues because you may have to capture the trace for a long period of time and this can generate quite a number of trace files when configured to save the trace files as chain (for this particular case the customer sent us more than 60 trace files each of size 25MB for the app sever and 6 such files for the SQL Server J). The exact time will help you to find out which trace file you need to start with. So getting the exact time of the issue is very important when you have too much data to review.
Analyzing Application Server Traces:
Let’s start with the trace file from the application server. From the time stamp we narrowed down what trace file to review first. For intermittent connectivity issues the first thing we are interested to check in a network trace is if there are any RESETs. From SQL server error log we found out that the SQL server is listening to port 1433. So we started with the following filter.
“tcp.port eq 1433 && tcp.flags.reset==1”
This filter should list all the RESETs to the SQL server (we are assuming that the application server is connecting only to the problem SQL server and no other SQL server that is also listening to port 1433). With this filter we found about 20 RESETs.
Now let’s check the full conversation for any of the RESETs . To get the conversation of a RESET frame do the following.
Select the frame with RESET--> right click --> Conversation Filter --> TCP
It will show you all the frames in that conversation. We found only two frames.
Now if you have collected traces in chain and have more trace files collected before and after this conversation then you should check those trace files to see if there are other frames from this conversation. We need to find out what was happening before the RESET was sent and that may tell us what caused the RESET. To do this, copy the filter for this conversation from the current trace file.
(ip.addr eq 10.10.100.59 and ip.addr eq 10.10.100.131) and (tcp.port eq 1194 and tcp.port eq 1433)
We opened the trace file collected in the application server just before the current trace file and used the above filter to check the frames that are part of the same conversation.
If you see no issues and normal traffic (like keep-alive packets) then open the trace file previous to that and use the same filter. Keep doing this until you see some sort of issue or reach at the beginning of the conversation (three way handshake to establish connection).
In our case we saw a lot of keep-alive traffic. SQL Server (10.10.100.131) and application server (10.10.100.59) were sending [TCP Keep-Alive] and [TCP Keep-Alive ACK] packets back and forth. But at the end we saw application server (10.10.100.59) sent five [TCP Keep-Alive] packets to the SQL server but did not get any response from the SQL server as shown below.
We reviewed the trace file before the current trace and it showed SQL Server (10.10.100.131) and application server (10.10.100.59) were sending [TCP Keep-Alive] and [TCP Keep-Alive ACK] packets back and forth. Then we checked the trace file after the first trace file we reviewed (where we found the RESET) there were no frames for this conversation. So this conversation ended after this RESET.
Analyzing SQL Server Traces:
Then we checked the trace file that we collected from the SQL server. Once again using the error time stamp we selected the trace file that we need to start with.
We started with the same filter to look for frames from the same conversation we are investigating in the application server trace.
It showed a lot of keep-alive traffic and the same pattern as we saw in the application server trace. SQL Server (10.10.100.131) and application server (10.10.100.59) were sending [TCP Keep-Alive] and [TCP Keep-Alive ACK] packets back and forth.
Then we checked the next file using the same filter for the same conversation. Initially it showed the same pattern as in the previous file; SQL Server (10.10.100.131) and application server (10.10.100.59) were sending [TCP Keep-Alive] and [TCP Keep-Alive ACK] packets back and forth. But at the end we saw application server (10.10.100.59) sent a RESET to the SQL server (10.10.100.131).
The time stamp for the RESET frame is different but it is the same frame as we found in the application server trace, because we are tracking the same conversation. So the keep-alive packets sent from the application server were getting dropped and did not reach to the SQL server. The application server sent 5 keep-alive packages (which is the default setting) and then sent RESET as it did not receive an acknowledgement from the SQL server.
If an open connection remains idle for some time TCP attempts to verify if the connection is still intact by sending a keep alive packet to its peer. If the remote server is still reachable and functioning, an acknowledge packet is sent back. Otherwise, the local TCP will keep sending “Keep Alive” packet at an interval of KeepAliveInterval for TcpMaxDataRetransmissions times. If no acknowledge packet is received during this period, the local TCP will reset the connection. For every SQL Server TCP connections, Keep Alive is 30,000 millisecond by default and KeepAliveInterval is hard-coded 1,000 millisecond. TcpMaxDataRetransmissions is 5 by default. It is configurable for entire machine through HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters. As an example, under default configuration,
The following blog post has more information on this.
So in our case application server is RESETing the connection as it did not get a response for 5 keep-alive packets it sent to the SQL server. So the next step would be to find out where exactly in the network these [TCP Keep-Alive] packets are getting dropped. To do this you need to collect precise network architecture details, including any intermediate hubs, firewalls, routers, switches between client and SQL Server. You need to have a clear picture of all the networking hardware between the client and the SQL Server, for example: Web Server: Network Card(s)->Local Hub->Cisco Firewall->Network Card(s)->SQL Server
Then we need to capture simultaneous network traces from all intermediate server(s)/hardware(s) along with the client and SQL server. By reviewing these traces we can further narrow down where the packets are getting dropped.
For this particular case both the SQL Server and Application server were hosted on the same physical machine using 3rd party visualization software. So ideally there shouldn’t be anything in between the SQL server and the application server. So our conclusion was, somehow the virtualization software was causing the issue and we suggested that the customer either to reproduce the issue outside virtual environment or involve virtualization software provider to find why the packets are getting dropped.
Customer decided to involve the virtualization software provider. However, as the customer was waiting to get a resolution from the virtualization software provider we were able to implement a work around by increasing the “TcpMaxDataRetransmissions” setting to a higher value.
Author : Mohammad(MSFT) SQL Developer Engineer, Microsoft
Reviewed by : Enamul(MSFT), SQL Developer Technical Lead , Microsoft
Has this issue been resolved? I have involved Microsoft Australia, so any
help would be appreciated.
+61 417 526 018
This was a generic issue that was discussed in this blog. It was not about any bug or annomaly in our product that could be resolved.
The only way to figure out what is causing a intermittent connection failure is to take a simultaneous netmon. Once the source of the issue is determined, we need to follow the appropiate step to fix it.
Can you tell the proper way to set the "KeepAlive" and "TcpMaxDataRetransmissions"?
We tried changing the Value of "KeepAlive" to 3600000 by adding new DWORD entry for "KeepAlive" in "HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters" , but still there is transfer of data for every 30 sec(default value)(transfer of data checked using Network monitor)
Also changed the "KeepAlive" value to 360000 in MS SQL configuration manager but no changes in data transfer time
I can't think of any obvious reason why it would not make any effect.
Did you restart the system afer making those changes?
To make sure you have created the right Dword, can you delete it and set it from SQL configuration manager again. Once you do that it should create the properties automatically.
After the setting system was restarted
Registry entry for "KeepAlive" parameter of SQL configuration manager is not in the same path as the one where "KeepAlive" DWORD was created, also all the "KeepAlive" entry in SQL configuration manager has been changed to 3600000
Still why there is no effect?
Not sure why is it acting that way. I am from SQL background and this needs to be looked at, from network prospective.
I think MSDN forum would be a better place for you to seek help.