Ahh yes.. the conch shell .  I run into weird problems all the time..this was on of those weird problems.

The high level statement was this:

"Clients in remote sites are accessing mapped drives  via explorer and the UI hangs for 10-15 minutes, they lose access to the mapped drive and on occasion the client bugchecks during this period."

See when you get done reading this you may not think it's so weird after all. But, think of it from a system admin level -  they had looked at everything from net traces to perf data to all the switches and hubs etc between the servers.

Here is a simple rundown of how it works..

 

 dfs

 

Inbound requests hit the file server which needs to process the request. In order to process dfs data, SRV.SYS hands off to DFS.SYS.This is all under the system process, but at times,  the dfs driver will need to have the data processed by the DFS service (DFSSVC.EXE.)

When it does need to do this, it will use what called the "user mode reflection" set of API's...basically a way to transition data to and from the usermode portion of dfssvc.exe

The dfssvc.exe process has a number of threads which will service the UMRx calls and then return the data to DFS.SYS and on and on we go.

Sidenote:

One of the things I wish we could fix - is the need to do live debugs or collect  full kernel mode dumps which take down a server  (I'm sure that there are big brainers working on this )  - however for now, it is still a necessary evil at times.

 

So, we got a kernel dump of the Server.

Based on the data we see in the SRV.SYS process – we can see that the SRV driver is passing data to DFS.SYS. See below for details

 

Here we use the !stacks  command  in order  to find all stacks with SRV in them…. Since we know that the SRV.sys  driver ( server ) handles inbound connections and we are having trouble with  inbound connections… we start here:

 

0: kd> !stacks 2 srv

Proc.Thread  .Thread  Ticks   ThreadState Blocker

                            [89b9a7e0 System]

 

   4.000690  897e4db0 0008171 Blocked    nt!KiSwapContext+0x26

                                        nt!KiSwapThread+0x284

                                        nt!KeWaitForSingleObject+0x346

                                        Dfs!UMRxEngineSubmitRequest+0x1b3

                                        Dfs!UMRxGetReplicasContinuation+0x1c

                                        Dfs!UMRxEngineInitiateRequest+0x44

                                        Dfs!DfsGetReplicaInformation+0x6a

                                        Dfs!DfsFsctrlGetReferrals+0x102

                                        Dfs!DfsHandlePrivateFsControl+0x12a

                                        Dfs!DfsFastIoDeviceControl+0x37

                                        srv!DfsGetReferrals+0xe4

                                        srv!SrvSmbGetDfsReferral+0xe1

                                        srv!ExecuteTransaction+0x29e

                                        srv!SrvSmbTransaction+0x7ac

                                        srv!SrvProcessSmb+0xb7

                                        srv!WorkerThread+0x138

                                        nt!PspSystemThreadStartup+0x2e

                                        nt!KiThreadStartup+0x16

   4.000694  8939cdb0 00075a5 Blocked    nt!KiSwapContext+0x26

                                        nt!KiSwapThread+0x284

                                        nt!KeWaitForSingleObject+0x346

                                        Dfs!UMRxEngineSubmitRequest+0x1b3

                                        Dfs!UMRxGetReplicasContinuation+0x1c

                                        Dfs!UMRxEngineInitiateRequest+0x44

                                        Dfs!DfsGetReplicaInformation+0x6a

                                        Dfs!DfsFsctrlGetReferrals+0x102

                                        Dfs!DfsHandlePrivateFsControl+0x12a

                                        Dfs!DfsFastIoDeviceControl+0x37

                                        srv!DfsGetReferrals+0xe4

                                        srv!SrvSmbGetDfsReferral+0xe1

                                        srv!ExecuteTransaction+0x29e

                                        srv!SrvSmbTransaction+0x7ac

                                        srv!SrvProcessSmb+0xb7

                                        srv!WorkerThread+0x138

                                        nt!PspSystemThreadStartup+0x2e

                                        nt!KiThreadStartup+0x16

 

 

You can see that the calls DFS is making are like DFS!UMRx****

Basically all of the SRV threads were in this same state so, they obviously are waiting to talk to DFS ( or get a response from it )  - let's look at dfssvc.exe

 

 

 First, we find out what the process info is for dfssvc.exe

 

 

0: kd> !process 0 0 dfssvc.exe

PROCESS 88e0fd88  SessionId: 0  Cid: 1614    Peb: 7ffd5000  ParentCid: 0248

    DirBase: 2fa75000  ObjectTable: e291dc88  HandleCount: 154.

    Image: dfssvc.exe

 

Next, we place ourselves into the proper process context.

 

0: kd> .process /p /r 88e0fd88 

Implicit process is now 88e0fd88

Loading User Symbols

 

 

 

Now we list detailed info about this process.. including all threads and parameters.

 

 

0: kd> !process 88e0fd88 7

PROCESS 88e0fd88  SessionId: 0  Cid: 1614    Peb: 7ffd5000  ParentCid: 0248

    DirBase: 2fa75000  ObjectTable: e291dc88  HandleCount: 154.

    Image: dfssvc.exe

    VadRoot 8903dcb8 Vads 107 Clone 0 Private 823. Modified 0. Locked 0.

    DeviceMap e1003940

    Token                             e2fd3030

    ElapsedTime                       00:09:29.406

    UserTime                          00:00:00.000

    KernelTime                        00:00:00.000

    QuotaPoolUsage[PagedPool]         33160

    QuotaPoolUsage[NonPagedPool]      5808

    Working Set Sizes (now,min,max)  (1735, 50, 345) (6940KB, 200KB, 1380KB)

    PeakWorkingSetSize                1943

    VirtualSize                       39 Mb

    PeakVirtualSize                   40 Mb

    PageFaultCount                    7525

    MemoryPriority                    BACKGROUND

    BasePriority                      8

    CommitCharge                      988

 

 

 

<snip>

 

         THREAD 88e837a8  Cid 1614.0524  Teb: 7ffd4000 Win32Thread: 00000000 WAIT: (Unknown) UserMode Non-Alertable

            88eae2a0  SynchronizationEvent

        Not impersonating

        DeviceMap                 e1003940

        Owning Process            88e0fd88       Image:         dfssvc.exe

        Wait Start TickCount      127215         Ticks: 34822 (0:00:09:04.093)

        Context Switch Count      273            

        UserTime                  00:00:00.078

        KernelTime                00:00:00.046

        Win32 Start Address Dfssvc!DfsReflectionThread (0x010262c7)

        Start Address kernel32!BaseThreadStartThunk (0x77e6b5f3)

        Stack Init b9fc2000 Current b9fc1c60 Base b9fc2000 Limit b9fbf000 Call 0

        Priority 10 BasePriority 8 PriorityDecrement 0

        Kernel stack not resident.

        ChildEBP RetAddr  Args to Child             

        b9fc1c78 8083e6a2 88e83820 88e837a8 88e83850 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])

        b9fc1ca4 8083f164 00000000 00000000 00000000 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])

        b9fc1cec 8092db70 88eae2a0 00000006 b9fc1d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])

        b9fc1d50 80834d3f 000002a0 00000000 00000000 nt!NtWaitForSingleObject+0x9a (FPO: [Non-Fpo])

        b9fc1d50 7c82ed54 000002a0 00000000 00000000 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b9fc1

        00aaf4f4 7c822124 7c83970f 000002a0 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

        00aaf4f8 7c83970f 000002a0 00000000 00000000 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])

        00aaf534 7c839620 00000000 00000004 00000000 ntdll!RtlpWaitOnCriticalSection+0x19c (FPO: [Non-Fpo])

        00aaf554 0100b964 0025c2fc 00000000 0025c2f0 ntdll!RtlEnterCriticalSection+0xa8 (FPO: [Non-Fpo])

        00aaf574 0101aced 00aaf5a4 00aaf5ac 00aaf5f4 Dfssvc!DfsStore::LookupRoot+0x54 (FPO: [Non-Fpo])

        00aaf58c 0101aec5 00aaf5a4 00aaf5ac 00aaf5e0 Dfssvc!DfsGetRootFolder+0x23 (FPO: [Non-Fpo])

        00aaf5b4 0101af25 00aaf67c 00aaf5e0 00aaf5f4 Dfssvc!DfsGetRootFolder+0x43 (FPO: [Non-Fpo])

        00aaf5f8 0101b0cf 00aaf67c 00aaf64c 00aaf624 Dfssvc!DfsLookupFolder+0x57 (FPO: [Non-Fpo])

        00aaf618 0101b95d 00aaf67c 00aaf64c 00aaf6a4 Dfssvc!DfsGetReferralData+0x46 (FPO: [Non-Fpo])

        00aaf6b0 0101be32 0009b2fc 0083c120 000003e8 Dfssvc!DfsGenerateNormalReferral+0x5df (FPO: [Non-Fpo])

        00aaf6d4 0101e768 0009b2fc 0083c120 000003e8 Dfssvc!DfsGenerateReferral+0x41 (FPO: [Non-Fpo])

        00aaff58 0101e99e 0009b2d0 00aaff78 0009b2a8 Dfssvc!DfsGenerateReferralFromReplicaRequest+0x209 (FPO: [Non-Fpo])

        00aaff7c 010262a9 00000170 0009b2a8 00004000 Dfssvc!DfsProcessGetReplicaData+0x22 (FPO: [Non-Fpo])

        00aaff90 0102630d 00000170 0009b2a8 00000000 Dfssvc!DfsProcessWorkItemData+0x22 (FPO: [Non-Fpo])

        00aaffb8 77e6608b 0009b2a8 00000000 00000000 Dfssvc!DfsReflectionThread+0x46 (FPO: [Non-Fpo])

        00aaffec 00000000 010262c7 0009b2a8 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

 

 

There were a number of ‘reflection’ threads as noted by the highlighted part of the stack above.  Each one of them was in the same state – that is, waiting to enter a critical section. And all the threads were waiting on the same critical section ( think of it like  the conch shell.) If you are looking at the picture at the beginning of this post - the red thread is the fellow holding the shell.

 

 

 

 

 

So we need to find out who or what is going on with  that crazy conch shell...who is holding it now?

 

 

 

Let us see whats going on with that object: - note that the first paramter to EnterCriticalSection is the critical section object.. We pass that address to the !cs command.

 

0: kd> !cs 0025c2fc

-----------------------------------------

Critical section   = 0x0025c2fc (+0x25C2FC)

DebugInfo          = 0x0008e0b0

LOCKED

LockCount          = 0x4

WaiterWoken        = No

OwningThread       = 0x00001604

RecursionCount     = 0x1

LockSemaphore      = 0x2A0

SpinCount          = 0x00000fa0

 

The thread owner is  ID 1604 - this guy is holding the conch shell, what is he doing? Apparently trying to speak to someone else and that person is not responding - in fact, he has a whole slew of people he wants to talk to and apparently no one told him that many of them passed away at some point.

 

 

0: kd> !thread 8968ec50 

THREAD 8968ec50  Cid 1614.1604  Teb: 7ffd9000 Win32Thread: 00000000 WAIT: (Unknown) UserMode Non-Alertable

    8968ee3c  Semaphore Limit 0x1

Waiting for reply to LPC MessageId 0000f1c7:

Current LPC port e39ee3a0

Not impersonating

DeviceMap                 e1003940

Owning Process            88e0fd88       Image:         dfssvc.exe

Wait Start TickCount      162037         Ticks: 0

Context Switch Count      1929            

UserTime                  00:00:00.140

KernelTime                00:00:00.375

Win32 Start Address Dfssvc!DfsWorkerThread (0x0100811f)

Start Address kernel32!BaseThreadStartThunk (0x77e6b5f3)

Stack Init ba00a000 Current ba009c08 Base ba00a000 Limit ba007000 Call 0

Priority 10 BasePriority 8 PriorityDecrement 0

ChildEBP RetAddr  Args to Child              

ba009c20 8083e6a2 8968ecc8 8968ec50 8968ecf8 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])

ba009c4c 8083f164 8968ee3c 8968ee10 8968ec50 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])

ba009c94 8093983f 8968ee3c 00000011 8092da01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])

ba009d50 80834d3f 00000200 0008cb68 0008cb68 nt!NtRequestWaitReplyPort+0x776 (FPO: [Non-Fpo])

ba009d50 7c82ed54 00000200 0008cb68 0008cb68 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ ba009d64)

009af330 7c821c94 77c72700 00000200 0008cb68 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

009af334 77c72700 00000200 0008cb68 0008cb68 ntdll!NtRequestWaitReplyPort+0xc (FPO: [3,0,0])

009af380 77c713ba 009af3bc 009af3a0 77c72c7f RPCRT4!LRPC_CCALL::SendReceive+0x230 (FPO: [Non-Fpo])

009af38c 77c72c7f 009af3bc 76ed2e60 009af7a8 RPCRT4!I_RpcSendReceive+0x24 (FPO: [Non-Fpo])

009af3a0 77ce219b 009af3e8 0008cbcc 00000000 RPCRT4!NdrSendReceive+0x2b (FPO: [Non-Fpo])

009af788 76ed35d5 76ed2e60 76ed2d2c 009af7a8 RPCRT4!NdrClientCall2+0x22e (FPO: [Non-Fpo])

009af7a0 76ed355f 00000000 005b1fcc 00000001 DNSAPI!R_ResolverQuery+0x1c (FPO: [Non-Fpo])

009af7fc 76ef612a 005b1fcc 00000001 14000000 DNSAPI!Query_PrivateExW+0x187 (FPO: [Non-Fpo])

009af820 71b2555d 005b1fcc 00000001 14000000 DNSAPI!dnsrslvr_bhandle <PERF> (DNSAPI+0x2612a)

009af850 71b254d0 005b1fcc 00000001 14000000 mswsock!SaBlob_Query+0x2d (FPO: [Non-Fpo])

009af894 71b25390 00000000 005775e8 005775d0 mswsock!Rnr_DoDnsLookup+0xf0 (FPO: [Non-Fpo])

009afb2c 71c02ad9 005b1f68 00000000 009afbe0 mswsock!Dns_NSPLookupServiceNext+0x24b (FPO: [Non-Fpo])

009afb44 71c02ab9 005767f8 005b1f68 00000000 WS2_32!NSPROVIDER::NSPLookupServiceNext+0x17 (FPO: [Non-Fpo])

009afb60 71c02a02 00576930 00000000 009afbe0 WS2_32!NSPROVIDERSTATE::LookupServiceNext+0x1c (FPO: [Non-Fpo])

009afb8c 71c0299c 005775e8 00000000 009afbe0 WS2_32!NSQUERY::LookupServiceNext+0xae (FPO: [Non-Fpo])

009afbac 71c03c41 005775d0 00000000 009afbe0 WS2_32!WSALookupServiceNextW+0x78 (FPO: [Non-Fpo])

009afbd0 71c03b9c 005775d0 00000000 0000013c WS2_32!WSALookupServiceNextA+0x63 (FPO: [Non-Fpo])

009afbfc 71c04250 009afc2c 0000013c 000c6030 WS2_32!getxyDataEnt+0xa1 (FPO: [Non-Fpo])

009afe38 01007b56 000c6030 00090008 000c6030 WS2_32!gethostbyname+0xb4 (FPO: [Non-Fpo])

009afe4c 01007bb9 00836860 0084b018 0025a140 Dfssvc!DfsServerSiteInfo::DfsGetSiteForServer+0x33 (FPO: [Non-Fpo])

009afe64 01007bf9 00836860 0025a140 009afee8 Dfssvc!DfsServerSiteInfo::DfsInitializeServerSiteInfo+0x29 (FPO: [Non-Fpo])

009afe78 01007c9b 009afe90 00836860 0025a140 Dfssvc!DfsServerSiteInfo::DfsCreateServerSiteInfo+0x33 (FPO: [Non-Fpo])

009afe98 01007d69 00836860 009afee8 7c82260b Dfssvc!DfsSiteSupport::AddServerSiteInfo+0x1f (FPO: [Non-Fpo])

009afeb0 010195b2 00836860 009afee8 009afefb Dfssvc!DfsSiteSupport::GetServerSiteInfo+0x53 (FPO: [Non-Fpo])

009afefc 01019b4a 0025aec0 01246828 00773954 Dfssvc!DfsRootFolder::LoadCachedServerSiteData+0x141 (FPO: [Non-Fpo])

009aff18 01019bb0 0092aaf0 009aff40 01025217 Dfssvc!DfsRootFolder::PrefetchReplicaData+0x2b (FPO: [Non-Fpo])

009aff24 01025217 0092aaf0 0025c7d0 0025c7d0 Dfssvc!PrefetchCallBack+0x1f (FPO: [Non-Fpo])

009aff40 01019be0 00773398 01019b91 0025c7d0 Dfssvc!DfsEnumeratePrefixTableLocked+0x2a (FPO: [Non-Fpo])

009aff5c 0100d452 00000004 0025c2f0 00000000 Dfssvc!DfsRootFolder::PreloadServerSiteData+0x27 (FPO: [0,0,0])

009aff6c 010073bb 00000000 010081a2 00000000 Dfssvc!DfsStore::LoadServerSiteDataPerRoot+0x1f (FPO: [0,0,0])

009aff74 010081a2 00000000 00000000 00000000 Dfssvc!StartPreloadingServerSiteData+0x12 (FPO: [0,0,0])

009affb8 77e6608b 00000000 00000000 00000000 Dfssvc!DfsWorkerThread+0x83 (FPO: [Non-Fpo])

009affec 00000000 0100811f 00000000 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

 

 

 

Per MSDN - http://msdn2.microsoft.com/en-us/library/ms738524.aspx

The first param passed to gethostbyname() is the name of the server we want the info of.

 

0: kd> da 000c6030

000c6030  "MyBadServer"

 

 

Interesting. We gathered another dump during another outage and it showed the same pattern – where we are processing on the wire data gethostbyname for non-existant servers. A whole lot of them - around 200 or so out of ~700 links.

 

 

 

 

So in the end - the moral of the story is to clean up your DFS infrastructure - stay on top of it and don't letsomeone hold that conch shell for too long wasting time waiting on a response he will never get.

 

Oh, and the client bugchecks? Fixed via a later mup.sys fix on the client (XP) - sorry I cant recall the fix number OTOH right now. 

 

I try to make these posts as  easy to follow as possible - no need for src code in this one and no fancy debugger commands. You'd be suprised how far a basic understanding of how things work will take you. Conch shells  are a good thing to understand.

 

Maybe I’ll try a dbg video for a post sometime… might be easier to follow.

 

spatdsg