DNS Resolution Bottleneck in Windows Server 2000/2003 and Windows XP
Summary
DNS name resolution was failing on a Windows 2003 server when we tried to access SYSVOL shares from our Domain Controllers. This only seemed to happen when we referenced the share by fully qualified domain name (FQDN). It was not hard to reproduce the failure by typing the following command:
C:\>dir \\DC-01.somedomain.com\SYSVOL
The network path was not found.
I'm using simulated server and domain names here, but you get the idea. If you reference the share by short name, there was never any problem. The command dir \\DC-01\SYSVOL would work just fine.
Digging Deeper...
I captured the network traffic, and found that there were no DNS queries on the wire. I knew something was going on with the client resolver, and this was not a DNS server issue. I setup a test environment where I was able to reproduce the problem. Run the following script in two or three different CMD windows, and you should see an error:
@echo off
:loop
call :CheckServer DC-01
call :CheckServer DC-02
call :CheckServer DC-03
call :CheckServer DC-04
call :CheckServer DC-05
call :CheckServer DC-06
call :CheckServer DC-07
call :CheckServer DC-08
call :CheckServer DC-09
call :CheckServer DC-10
call :CheckServer DC-11
call :CheckServer DC-12
sleep.exe 1
goto :loop
:CheckServer
if exist "\\%~1.somedomain.com\SYSVOL" goto :EOF
if not exist "\\%~1\SYSVOL" goto :EOF
if exist "\\%~1.somedomain.com\SYSVOL" goto :EOF
echo %DATE% %TIME% - %1
goto :EOF
This test script requires the sleep command from the Windows Server 2003 Resource Kit Tools. I started isolating the problem once I had reproduced it in the lab. My first step was to add all of the server names by FQDN to the hosts file. I wanted to see if this was a problem with BIND, or some other system component. As it turned out, the problem persisted even though all of the names were hard coded in the hosts file. I enabled ETW tracing of NetBT to see what was going on. The test script prints a time stamp when a failure occurs, so I looked at the ETW log entries for the same time stamp. I saw a lot of STATUS_TIMEOUT errors, and discovered the resolution request was timing out. It appeared we had a resource issue with name resolution.
The Problem
NetBT processes name resolution requests serially, and if you queue enough of them they will start to time out. The official explanation is described under the cause section of KB article 875441, but I'll summarize here:
"When the name resolution request for FQDN is queued inside NetBT, the request times out, the redirector closes the connection after about eight seconds, and the FQDN name is not resolved. The issue occurs because of contention for the NetBT user mode DNS resolver. This resolver can only resolve names serially."
The Solution
That KB article was not an exact match for the problem we were dealing with here, but it's close enough. I tried workaround #1, and adjusted the LmhostsTimeout to 20 seconds. That improved things, but I was still able to reproduce the problem. As it turns out, workaround #2 solved this problem once and for all. When you install IPv6, the system will use the resolver in smb.sys instead of NetBT, and that eliminates the resource contention. You can run the IPv4 and IPv6 stacks side by side, so you don't have to worry about changing your infrastructure to solve this problem. NetBT is legacy, so Vista and Windows Server 2008 use smb.sys by default.
Links
http://support.microsoft.com/default.aspx?scid=KB;EN-US;875441