• Ntdebugging Blog

    Debug Sleuth at Work: Hung Server…..Mystery of the unprocessed SMB work item.

    • 5 Comments

    Hello folks, Pushkar here. I recently worked on a case where the server was hung at “Applying User Settings” during the logon phase. You might ask what’s going to be new in this post, NTDebugging has bunch of posts covering such debug scenarios J. In my defense, this case was particularly interesting because it defied the general logic that I had about processing of network connection request. I am still a newbie in this area getting my feet wet in the arena of debugging. I had to seek help from one of the GES guru Aman Jain. He is an expert in the Windows Networking Stack. He helped me in confirming my suspicion.

     

    As we were stuck at “Applying User Settings”, the obvious path was to look at the activity of Winlogon to understand why is it stuck? I could see that there were seven logon requests which haven’t been processed yet.

     

    1: kd> !session

    Sessions on machine: 7

    Valid Sessions: 0 1 2 3 4 5 6

     

    1: kd> !process 0 0 winlogon.exe

    PROCESS fffffa806e132c10

        SessionId: 1  Cid: 0384    Peb: 7fffffdf000  ParentCid: 0298

        DirBase: db59d000  ObjectTable: fffff88007fd0430  HandleCount: 139.

        Image: winlogon.exe

    <snip>

     

    The above command returns 7 instances of Winlogon.exe, one instance per session. I picked one of the sessions as all the 7 sessions were in the same state.

     

    As this was a Windows 2008 Server, winlogon would open an LPC Channel with Logonui.exe to process the user’s logon and then it will call into Local Session Manager (LSM.EXE) to spawn the users’ session. I could see that winlogon is waiting on message reply from Logonui.exe.

     

    1: kd> !thread fffffa80`73d3d770

    THREAD fffffa8073d3d770  Cid 0384.1e2c  Teb: 000007fffffd7000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa8073d3db00  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801ffd9cf0 : queued at port fffffa804e9d9370 : owned by process fffffa806e134c10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e132c10       Image:         winlogon.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

    Context Switch Count      2            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x0000000076f67cb0

    Stack Init fffffa60110a9db0 Current fffffa60110a9670

    Base fffffa60110aa000 Limit fffffa60110a4000 Call 0

    Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`110a96b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`110a97f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`110a9860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`110a98f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`110a9980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`110a99e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`110a9b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`110a9bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`01f8f098 00000000`00000000 0x76f9756a

     

    1: kd> !alpc /m fffff8801ffd9cf0

     

    Message @ fffff8801ffd9cf0

      MessageID             : 0x0464 (1124)

      CallbackID            : 0x2844F0 (2639088)

      SequenceNumber        : 0x0000000B (11)

      Type                  : LPC_REQUEST

      DataLength            : 0x0064 (100)

      TotalLength           : 0x008C (140)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa80481ba9b0 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa8073d3d770

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e9d9370 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa806e134c10 (LogonUI.exe)

      ServerThread          : fffffa804f96d5a0                  <----------- LPC Message Server Thread

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 0000000000243d00

      ServerContext         : 0000000000000000

      PortContext           : 0000000001c27600

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    1: kd> !thread fffffa804f96d5a0

    THREAD fffffa804f96d5a0  Cid 01ac.09fc  Teb: 000007fffffd7000 Win32Thread: fffff900c067dd50 WAIT: (UserRequest) UserMode Non-Alertable

        fffffa80516be920  NotificationEvent

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e134c10       Image:         LogonUI.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

    Context Switch Count      13                 LargeStack

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa60114e4db0 Current fffffa60114e4940

    Base fffffa60114e5000 Limit fffffa60114df000 Call 0

    Priority 14 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`114e4980 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`114e4ac0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`114e4b30 fffff800`02119608 nt!KeWaitForSingleObject+0x2cb

    fffffa60`114e4bc0 fffff800`01eb8ef3 nt!NtWaitForSingleObject+0x98

    fffffa60`114e4c20 00000000`76f96d5a nt!KiSystemServiceCopyEnd+0x13

    00000000`0444ef18 00000000`00000000 0x76f96d5a

     

    This server thread is processing a message, but we can’t tell what it is waiting for. If this thread was finished processing a message we would have seen “WAIT: (WrLpcReceive)” for the thread wait reason. Often this type of wait is for a critical section or other synchronization object that is waiting for another thread in the process. This begged an investigation of Logonui.exe, to see what the other threads were doing.

     

    I noticed that another Logonui.exe thread is waiting on LPC Message processing by a thread in LSM.EXE.

     

    1: kd> !thread fffffa80`4e9e5060

    THREAD fffffa804e9e5060  Cid 01ac.0438  Teb: 000007fffffd9000 Win32Thread: fffff900c06db2e0 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa804e9e53f0  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801ffaacf0 : queued at port fffffa804e9b6060 : owned by process fffffa806e130c10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e134c10       Image:         LogonUI.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192488        Ticks: 1881 (0:00:00:29.343)

    Context Switch Count      165                 LargeStack

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefbaa151c

    Stack Init fffffa6008d26db0 Current fffffa6008d26670

    Base fffffa6008d27000 Limit fffffa6008d20000 Call 0

    Priority 15 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`08d266b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`08d267f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`08d26860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`08d268f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`08d26980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`08d269e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`08d26b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`08d26bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`0272ea78 00000000`00000000 0x76f9756a

     

    At this point I decided to follow the LPC message chain trail to see where it ends and who is ultimately blocking it.

     

    1: kd> !alpc /m fffff8801ffaacf0

     

    Message @ fffff8801ffaacf0

      MessageID             : 0x0474 (1140)

      CallbackID            : 0x2844F6 (2639094)

      SequenceNumber        : 0x0000000D (13)

      Type                  : LPC_REQUEST

      DataLength            : 0x0058 (88)

      TotalLength           : 0x0080 (128)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa804ea03c80 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa804e9e5060

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e9b6060 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa806e130c10 (lsm.exe)

      ServerThread          : fffffa805180c700            <---- LPC Server thread for this message in LSM.EXE

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 0000000001cf0e90

      ServerContext         : 0000000000000000

      PortContext           : 0000000001dc9df0

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    I noticed that the server thread which is supposed to process the LPC message for above thread is itself waiting on a message reply from a thread in Svchost.exe. LSM had bunch of other threads which were waiting on LPC message processing from server threads in Svchost.exe. I dumped out all the threads and their LPC message structure to locate the server threads in Svchost.exe which seems to be doing some activity and found that only for the thread below the server thread had some interesting information as other LPC Server threads in svchost.exe were Notification Timers.

     

    1: kd> !thread fffffa80`50910060

    THREAD fffffa8050910060  Cid 02e0.1a70  Teb: 000007fffffda000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa80509103f0  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801fe00cf0 : queued at port fffffa804e876b60 : owned by process fffffa804ea4fc10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e130c10       Image:         lsm.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      3187894        Ticks: 1006475 (0:04:21:41.110)

    Context Switch Count      97            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa60125b1db0 Current fffffa60125b1670

    Base fffffa60125b2000 Limit fffffa60125ac000 Call 0

    Priority 9 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Kernel stack not resident.

    Child-SP          RetAddr           Call Site

    fffffa60`125b16b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`125b17f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`125b1860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`125b18f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`125b1980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`125b19e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`125b1b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`125b1bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`0205e538 00000000`00000000 0x76f9756a

     

    I looked at the server thread for this LPC message and that was also waiting on a notification event doing nothing. It again begged an investigation of the state of Svchost.exe to find out why the server thread hasn’t replied yet?

     

    1: kd> !alpc /m fffff8801fe00cf0

     

    Message @ fffff8801fe00cf0

      MessageID             : 0x03C4 (964)

      CallbackID            : 0x2140C2 (2179266)

      SequenceNumber        : 0x00000021 (33)

      Type                  : LPC_REQUEST

      DataLength            : 0x00C0 (192)

      TotalLength           : 0x00E8 (232)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa804ef15d30 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa8050910060

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e876b60 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa804ea4fc10 (svchost.exe)

      ServerThread          : fffffa8051435940

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 00000000021a0f60

      ServerContext         : 0000000000000000

      PortContext           : 00000000033834d0

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    1: kd> !thread fffffa8051435940

    THREAD fffffa8051435940  Cid 0574.1ea0  Teb: 000007fffffd3000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

        fffffa601086d3e8  SynchronizationEvent

    IRP List:

        fffffa80514558b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

    Impersonation token:  fffff8801fc4b050 (Level Impersonation)

    DeviceMap                 fffff8800e7c50e0

    Owning Process            fffffa804ea4fc10       Image:         svchost.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      3187932        Ticks: 1006437 (0:04:21:40.517)

    Context Switch Count      111            

    UserTime                  00:00:00.015

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa601086ddb0 Current fffffa601086d040

    Base fffffa601086e000 Limit fffffa6010868000 Call 0

    Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`1086d080 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`1086d1c0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`1086d230 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

    fffffa60`1086d2c0 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

    fffffa60`1086d320 fffffa60`015da75a nt!FsRtlCancellableWaitForSingleObject+0x27

    fffffa60`1086d360 fffffa60`015d8588 mup!MupLocateUncProviderForPath+0x40a

    fffffa60`1086d4a0 fffffa60`015d8130 mup!MupiCallUncProvider+0xe8

    fffffa60`1086d510 fffffa60`015d90ef mup!MupStateMachine+0x120

    fffffa60`1086d560 fffffa60`00f31e91 mup!MupCreate+0x2c3

    fffffa60`1086d5e0 fffffa60`00f4e26c fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x211

    fffffa60`1086d650 fffff800`021267d9 fltmgr!FltpCreate+0x25d

    fffffa60`1086d700 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

    fffffa60`1086d8a0 fffff800`0211f842 nt!ObpLookupObjectName+0x593

    fffffa60`1086d9b0 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

    fffffa60`1086da80 fffff800`0212ecc8 nt!IopCreateFile+0x287

    fffffa60`1086db20 fffff800`01eb8ef3 nt!NtCreateFile+0x78

    fffffa60`1086dbb0 00000000`76f9726a nt!KiSystemServiceCopyEnd+0x13

    00000000`2674d738 00000000`00000000 0x76f9726a

     

    Bingo! I found two threads in Svchost.exe which showed some activity and were blocked for over 4 hours on a UNC Path access. From the IRP which was issued by the thread above I found out that it was attempting to contact one of the DCs.

     

    1: kd> !irp 0xfffffa80`514558b0

    Irp is active with 4 stacks 3 is current (= 0xfffffa8051455a10)

    No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  0, 0]   0 e0 fffffa8031d1a840 fffffa805145aec0 fffffa6000f332b0-fffffa8047b6c930 Success Error Cancel

                 \FileSystem\Mup  fltmgr!FltpSynchronizedOperationCompletion

                      Args: fffffa601086d848 01000040 00030000 00000000

    [  0, 0]   0  0 fffffa8031d1b880 fffffa805145aec0 00000000-00000000   

                 \FileSystem\FltMgr

                      Args: fffffa601086d848 01000040 00030000 00000000

     

    1: kd> dt nt!_FILE_OBJECT fffffa805145aec0 FileName

       +0x058 FileName : _UNICODE_STRING "\DC005.contoso.com\PIPE\lsarpc"

     

    For the details on MUP you can refer to Support for UNC Naming and MUP.

     

    At first my thought was that something is wrong with the network stack because we are waiting on domain controller to respond. The thought was that either we had a networking issue or the Domain Controller wasn’t responding, but redirector would have timed out as it won’t wait for over 4 hours for an unresponsive connection. I started digging into the source code and found that we build a query context in MupLocateUncProviderForPath () function and issue another IRP which is sent down the I/O Stack. I searched for other IRPs which was issued by the above thread and found two IRPs.

     

    1: kd> !irpfind 0 0 thread fffffa8051435940

    Looking for IRPs with thread == fffffa8051435940

     

    Scanning large pool allocation table for Tag: Irp? (fffffa804ea83000 : fffffa804ed83000)

     

     

    Searching NonPaged pool (fffffa8030605000 : fffffa8c27e00000) for Tag: Irp?

     

      Irp    [ Thread ] irpStack: (Mj,Mn)   DevObj  [Driver]         MDL Process

    fffffa80514558b0 [fffffa8051435940] irpStack: ( 0, 0)  fffffa8031d1a840 [ \FileSystem\Mup]

    fffffa8073bc65e0 [fffffa8051435940] irpStack: ( e, 0)  fffffa804e842610 [ \FileSystem\mrxsmb]

      

    1: kd> !irp fffffa8073bc65e0

    Irp is active with 2 stacks 1 is current (= 0xfffffa8073bc66b0)

     No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    >[  e, 0]   0 e1 fffffa804e842610 fffffa804e843f20 fffffa60015d4024-fffffa601086d3b0 Success Error Cancel pending

                 \FileSystem\mrxsmb     mup!MupiQueryPathCompletionRoutine

                      Args: 00000004 00000050 00140193 fffffa601086d408

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

     

    This is the IRP that MUP has issued an IOCTL IOCTL_REDIR_QUERY_PATH to the redirector.  If the status returned by the driver is anything other than STATUS_PENDING then it will wait by calling FsRtlCancellableWaitForSingleObject.

     

    The 0xfffffa8051435940 thread doesn’t have any redirector function in its stack, which made me suspicious that there could be some other thread which is already working on similar operation and redirector is currently busy processing that request. I started looking for a worker thread which would be calling the redirector.

     

    1: kd> !stacks 0 mrxsmb

    Proc.Thread  .Thread  Ticks   ThreadState Blocker

                                [fffffa804e9c7c10 svchost.exe]

     248.000e84  fffffa804ee72bb0 ffe0d0ad Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

                                [fffffa804e9e12e0 svchost.exe]

     404.000f18  fffffa804edad4f0 ffe03e7e Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

     

    The above command will parse through all the kernel stacks for all the process and dump out one line summary info for the threads where it finds reference to that module in its stack. I have snipped the output to keep things in context.

     

    One of the sample threads where redirector is in action is shown below.

     

    1: kd> !thread fffffa804ee72bb0

    THREAD fffffa804ee72bb0  Cid 0248.0e84  Teb: 000007fffffa8000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

        fffffa8048f6c6b8  SynchronizationEvent

    IRP List:

        fffffa8073ac3980: (0006,03a0) Flags: 00000884  Mdl: 00000000

        fffffa807257fc60: (0006,03a0) Flags: 00000884  Mdl: 00000000

        fffffa80511627b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

    Impersonation token:  fffff8800d515c40 (Level Impersonation)

    Owning Process            fffffa804e9c7c10       Image:         svchost.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      2043731        Ticks: 2150638 (0:09:19:10.167)

    Context Switch Count      893            

    UserTime                  00:00:00.031

    KernelTime                00:00:00.109

    Win32 Start Address 0x000007fefb96c534

    Stack Init fffffa60125a1db0 Current fffffa60125a0ab0

    Base fffffa60125a2000 Limit fffffa601259c000 Call 0

    Priority 10 BasePriority 1 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`125a0af0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`125a0c30 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`125a0ca0 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

    fffffa60`125a0d30 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

    fffffa60`125a0d90 fffffa60`0f625c14 nt!FsRtlCancellableWaitForSingleObject+0x27

    fffffa60`125a0dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

    fffffa60`125a0e10 fffffa60`0f63599f rdbss!RxFindOrCreateConnections+0x572

    fffffa60`125a0ef0 fffffa60`0f627318 rdbss!RxConstructVirtualNetRoot+0xb7

    fffffa60`125a0f90 fffffa60`0f63815a rdbss!RxFindOrConstructVirtualNetRoot+0x594

    fffffa60`125a1100 fffffa60`0f627eee rdbss!RxCreateTreeConnect+0x13e

    fffffa60`125a1190 fffffa60`0f6041f6 rdbss!RxCommonCreate+0x20a

    fffffa60`125a1250 fffffa60`0f62568a rdbss!RxFsdCommonDispatch+0x786

    fffffa60`125a1340 fffffa60`1031f2cf rdbss!RxFsdDispatch+0x21a

    fffffa60`125a13b0 fffffa60`015d85f9 mrxsmb!MRxSmbFsdDispatch+0xbf

    fffffa60`125a13f0 fffffa60`015d8130 mup!MupiCallUncProvider+0x159

    fffffa60`125a1460 fffffa60`015d90ef mup!MupStateMachine+0x120

    fffffa60`125a14b0 fffffa60`00f4e342 mup!MupCreate+0x2c3

    fffffa60`125a1530 fffff800`021267d9 fltmgr!FltpCreate+0x333

    fffffa60`125a15e0 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

    fffffa60`125a1780 fffff800`0211f842 nt!ObpLookupObjectName+0x593

    fffffa60`125a1890 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

    fffffa60`125a1960 fffff800`0212ecc8 nt!IopCreateFile+0x287

    fffffa60`125a1a00 fffff800`01eb8ef3 nt!NtCreateFile+0x78

    fffffa60`125a1a90 fffff800`01eb9400 nt!KiSystemServiceCopyEnd+0x13

    fffffa60`125a1c98 fffffa60`0f664a5e nt!KiServiceLinkage

    <snip>

     

    The above thread is waiting for the condition of the remote connection to become stable and that’s the reason it called the function RxWaitForStableCondition(), which accepts RX_CONTEXT as its first argument and the reason for why we entered into the wait as second argument. As we can see that the thread above was waiting for over 9 hours and 19 minutes.

     

    For additional reading you can go through RX_CONTEXT and IRP Management.

     

    The unassembled code for the function below is to find out where we store the pointer to the RX_CONTEXT structure. You may remember about the challenges of x64 debugging from this post.

     

    1: kd> .frame /r 05

    05 fffffa60`11e42dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

    rax=0000000000000000 rbx=fffffa804f4de010 rcx=0000000000000000

    rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000

    rip=fffffa600f625c14 rsp=fffffa6011e42dd0 rbp=fffffa805113a490

    r8=0000000000000000  r9=0000000000000000 r10=0000000000000000

    r11=0000000000000000 r12=fffffa805113a4a0 r13=fffffa6011e42fd8

    r14=fffffa804f4de010 r15=0000000000000000

    iopl=0         nv up di pl nz na pe nc

    cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

    rdbss!RxWaitForStableCondition+0x11c:

    fffffa60`0f625c14 85c0            test    eax,eax

     

    1: kd> u rdbss!RxWaitForStableCondition

    rdbss!RxWaitForStableCondition:

    fffffa60`0f625af8 488bc4          mov     rax,rsp

    fffffa60`0f625afb 48895808        mov     qword ptr [rax+8],rbx

    fffffa60`0f625aff 48896810        mov     qword ptr [rax+10h],rbp

    fffffa60`0f625b03 48897018        mov     qword ptr [rax+18h],rsi

    fffffa60`0f625b07 48897820        mov     qword ptr [rax+20h],rdi

    fffffa60`0f625b0b 4154            push    r12

    fffffa60`0f625b0d 4883ec30        sub     rsp,30h

    fffffa60`0f625b11 33ff            xor     edi,edi

    1: kd> u

    rdbss!RxWaitForStableCondition+0x1b:

    fffffa60`0f625b13 833a03          cmp     dword ptr [rdx],3

    fffffa60`0f625b16 418af1          mov     sil,r9b

    fffffa60`0f625b19 4d8be0          mov     r12,r8

    fffffa60`0f625b1c 488bea          mov     rbp,rdx

    fffffa60`0f625b1f 488bd9          mov     rbx,rcx     <--------- RX_CONTEXT is in RBX register

    fffffa60`0f625b22 7c07            jl      rdbss!RxWaitForStableCondition+0x33 (fffffa60`0f625b2b)

    fffffa60`0f625b24 33c0            xor     eax,eax

    fffffa60`0f625b26 e968010000      jmp     rdbss!RxWaitForStableCondition+0x19b (fffffa60`0f625c93)

     

    1: kd> !pool fffffa804f4de010 2           <---- [Argument 1] Here is pointer to RX_CONTEXT Structure

    Pool page fffffa804f4de010 region is Nonpaged pool

    *fffffa804f4de000 size:  670 previous size:    0  (Allocated) *RxIr

                Pooltag RxIr : RDBSS RxContext, Binary : rdbss.sys

     

    1: kd> !pool fffffa805113a490 2           <---- [Argument 2] Here is the Pointer to the Reason for Wait

    Pool page fffffa805113a490 region is Nonpaged pool

    *fffffa805113a3b0 size:  280 previous size:   10  (Allocated) *RxSc

                Pooltag RxSc : RDBSS SrvCall, Binary : rdbss.sys

     

    1: kd> dc fffffa805113a490 L1

    fffffa80`5113a490  00000001               <---------------- Transport is in Transition state

     

    We have bunch of RX_CONTEXT describing the IRPs pending on redirector for the transport of the remote connection to stabilize.

     

    Next obvious thought that came to my mind was to investigate the list where we store the IRPs which have been issued by the redirector for the transport driver to process and understand the state of those IRPs and understand why we are not progressing further? RxCeSend is the API which submits the IRPs to the Transport driver which is maintained in the list below. For more information about Redirector related APIs you can refer Connection Engine Management.

     

    1: kd> x rdbss!RxIrpsList

    fffffa60`0f620110 rdbss!RxIrpsList = <no type information>

     

    The rdbss!RxIrpsList is a pointer to the structure of type LIST_ENTRY.

     

    1: kd> dq fffffa60`0f620110 L2

    fffffa60`0f620110  fffffa80`510fd0d0 fffffa80`4edb5590

     

    1: kd> dq fffffa80`510fd0d0 L2            <---------------- LIST_ENTRY.FLINK 

    fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

     

    1: kd> dq fffffa80`4edb5590 L2            <---------------- LIST_ENTRY.BLINK

    fffffa80`4edb5590  fffffa60`0f620110 fffffa80`4ff450d0

     

    This being a x64 windows each entry in this list is 0x30 bytes, this can change with OS version or driver updates, so don’t rely on this list or any of the offset that I referenced here in your driver code. Entry in the first column of this list is a pointer to the list entry. At offset 10 we store the pointer to the IRP. Add 0x10 bytes to get the IRP.

     

    1: kd> dl 0xfffffa80`510fd0d0

    fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

    fffffa80`510fd0e0  fffffa80`7274e5a0 00000000`00000000

    fffffa80`50d2e300  fffffa80`50ee4490 fffffa80`510fd0d0

    fffffa80`50d2e310  fffffa80`50f16160 00000000`00000000

    fffffa80`50ee4490  fffffa80`727f1140 fffffa80`50d2e300

    fffffa80`50ee44a0  fffffa80`51491760 00000000`00000000

    fffffa80`727f1140  fffffa80`73a95860 fffffa80`50ee4490

    fffffa80`727f1150  fffffa80`511b9c60 00000000`00000000

    fffffa80`73a95860  fffffa80`73a8e3e0 fffffa80`727f1140

    fffffa80`73a95870  fffffa80`73a77c60 00000000`00000000

    fffffa80`73a8e3e0  fffffa80`73a9c290 fffffa80`73a95860

    fffffa80`73a8e3f0  fffffa80`73a46a30 00000000`00000000

    fffffa80`73a9c290  fffffa80`725ca490 fffffa80`73a8e3e0

    <Snip>

     

    1: kd> !irp poi(fffffa80`510fd0d0+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa807274e6b8)

     No Mdl: No System Buffer: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  f, 3]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa601030df70-fffffa805119fda0 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousConnectCompletion

                      Args: 00000000 fffffa8050d127e0 00000000 fffffa601031b6f0

     

    1: kd> !irp poi(fffffa80`50d2e300+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa8050f16278)

     No Mdl: System buffer=fffffa8050d59c88: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa8050d59c60 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                      Args: 00000000 00000030 0021c0e8 00000000

     

    1: kd> !irp poi(fffffa80`50ee4490+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa8051491878)

     No Mdl: System buffer=fffffa805115dcb8: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa805115dc90 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                      Args: 00000000 00000030 0021c0e8 00000000

     

    All these IRPs are waiting on SMB Transport Driver (responsible for handling TCP/IP port 445 related traffic) for further processing. The obvious feeling was to find out if there are threads which are making call to SMB function to see what they are they doing?

     

    Note: In Windows 2008 R2 and Windows 7 onwards redirector doesn’t use SMB as the transport driver directly.

     

    I searched through the dump but I couldn’t find a single thread which was calling any of the SMB function. This is the point where it gets more interesting. Why we don’t have any thread which is calling SMB function when we have bunch of IRPs pending on that driver?

     

    This was the time for me to collaborate with the Network Stack Guru Aman Jain to get some insight to figure out the answer to my question.

     

    It turned out that SMB driver will create a work item and it will use the System Delayed Worker Queue to queue the work item that it created and will populate the routine in the work Item structure which is supposed to process the work item once the worker thread picks up the work item.

     

    1: kd> x nt!ExWorkerQueue

    fffff800`01ff08a0 nt!ExWorkerQueue = <no type information>

     

    Above pointer is the list head for the 3 queues that system has namely the “Critical WorkQueue” , “Delayed WorkQueue” and the “HyperCritical WorkQueue” as see in the output of !exqueue command in the debugger. Each entry in this list is 0x58 bytes long and is of type _KQUEUE.

     

    Structure referenced below decides how a various work queues are enumerated. This structure is documented in this MSDN article.

     

    typedef enum _WORK_QUEUE_TYPE {

      CriticalWorkQueue        = 0,

      DelayedWorkQueue         = 1,

      HyperCriticalWorkQueue   = 2,

      MaximumWorkQueue         = 3

    } WORK_QUEUE_TYPE;

     

    Critical WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0   

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff08b8 - 0xfffff800`01ff08b8 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4da08 - 0xfffffa80`30a39d68 ]

     

    Delayed WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffffa80`50226f00 - 0xfffffa80`5183f9f0 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a398d8 - 0xfffffa80`51157d68 ]

     

    HyperCritical WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58+58

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff0968 - 0xfffff800`01ff0968 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4e1f8 - 0xfffffa80`30a4e1f8 ]

     

    1: kd> dd fffff80001ff08a0+58+40 L1

    fffff800`01ff0938  00000010 <----------- We created 0x10 (16) dynamic threads

     

    We create additional dynamic threads as in the event when the default number of threads in a particular work queue is unable to handle the Work Items being queued. In this case we created 16 additional threads, which is the maximum number of dynamic thread which can be created.

     

    1: kd> dd fffff80001ff08a0+58+4c L1

    fffff800`01ff0944  00004667

     

    1: kd> ? 00004667

    Evaluate expression: 18023 = 00000000`00004667  <-------- 18,023 work Items pending for processing

     

    I could see that there are work items pending in the Delayed WorkQueue because of the following relation which is not present in any other WorkQueue that we dumped.

     

    nt!_KQUEUE->EntryListHead.Flink != nt!_QUEUE->EntryListHead.Blink

     

    For details about Work Item Processing and Work Items you can refer following docs:

    System Worker Threads

    IO_WORKITEM

    IoSizeofWorkItem

     

    The next obvious steps involved dumping all work items pending in the list to see if I have work items related to SMB driver? And if the Work Items are present for SMB then why they haven’t been picked up by the worker thread?

     

    1: kd> !list "-t nt!_LIST_ENTRY.Flink -e -x\"dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1\" fffffa80`50226f00"

     

    <Snip>

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`4b7c7e30  fffffa80`50a720e0

    fffffa80`4b7c7e50  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`72544e00  fffffa80`31aeb580

    fffffa80`72544e20  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`72759d50  fffffa80`51184690

    fffffa80`72759d70  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`73b416e0  fffffa80`5143c9c0

    fffffa80`73b41700  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    <snip>

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`518392e0  fffffa80`31fafa60

    fffffa80`51839300  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafa60  fffffa80`31fafab0

    fffffa80`31fafa80  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafab0  fffffa80`31fafc30

    fffffa80`31fafad0  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafc30  fffffa80`5183f9a0

    fffffa80`31fafc50  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`5183f9a0  fffffa80`4ea092e0

    fffffa80`5183f9c0  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`4ea092e0  fffffa80`5183f9f0

    fffffa80`4ea09300  fffffa60`00becb58 mpio!MPIOPathVerify

     

    Voila! Here are my SMB work items in the Delayed WorkQueue waiting to be processed. Another interesting observation was that this queue had around 18,023 work items in this queue waiting to be picked up by the worker threads. Most of the work items were related to MPIO. Investigation of the worker threads in the Delayed WorkQueue revealed that the worker threads were blocked processing MPIO work item on the DSM driver for the storage.

     

    I picked one of the worker entry from the _KQUEUE->ThreadListHead for the Delayed WorkQueue and validated that it is a thread pool allocation with !pool. As the size of _ETHREAD structure is 0x450 bytes and the pool allocation was for 0x490 bytes, this indicates that the pointer to the actual thread object would be 0x40 bytes into that pool allocation from the beginning of that pool block.

     

    1: kd> !pool 0xfffffa80`30a398d8  2

    Pool page fffffa8030a398d8 region is Nonpaged pool

    *fffffa8030a396e0 size:  490 previous size:   90  (Allocated) *Thre (Protected)

                Pooltag Thre : Thread objects, Binary : nt!ps

     

    1: kd> ?? sizeof(nt!_ETHREAD)

    unsigned int64 0x450

     

    1: kd> !thread fffffa8030a396e0+40

    THREAD fffffa8030a39720  Cid 0004.0044  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

        fffffa8031e8fe58  SynchronizationEvent

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa80309ca110       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      2015907        Ticks: 2178462 (0:09:26:24.225)

    Context Switch Count      42461            

    UserTime                  00:00:00.000

    KernelTime                00:00:14.976

    Win32 Start Address nt!ExpWorkerThread (0xfffff80001ec07c8)

    Stack Init fffffa600232bdb0 Current fffffa600232b800

    Base fffffa600232c000 Limit fffffa6002326000 Call 0

    Priority 15 BasePriority 12 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`0232b840 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`0232b980 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`0232b9f0 fffffa60`00c9c698 nt!KeWaitForSingleObject+0x2cb

    fffffa60`0232ba80 fffffa60`00c97a06 MyBadDSM+0xb698

    fffffa60`0232bb10 fffffa60`00becca3 MyBadDSM+0x6a06

    fffffa60`0232bb60 fffff800`02119693 mpio!MPIOPathVerify+0x14b

    fffffa60`0232bcc0 fffff800`01ec08c3 nt!IopProcessWorkItem+0x27

    fffffa60`0232bcf0 fffff800`020c3ee7 nt!ExpWorkerThread+0xfb

    fffffa60`0232bd50 fffff800`01ef6616 nt!PspSystemThreadStartup+0x57

    fffffa60`0232bd80 00000000`00000000 nt!KxStartSystemThread+0x16

     

    At this point it was evident that something has gone wrong with either the storage or the HBA adapter, Inspection of the MyBadDSM.sys code at this point could reveal if it was the problem with the driver or something has gone wrong with the HBA or the Fiber connection to the storage.

     

    This brings me to the end of this post. I hope you enjoyed reading this post. Goodbye. Until next time!

Page 1 of 1 (1 items)