Debug Sleuth at Work: Hung Server…..Mystery of the unprocessed SMB work item.

Debug Sleuth at Work: Hung Server…..Mystery of the unprocessed SMB work item.

  • Comments 5

Hello folks, Pushkar here. I recently worked on a case where the server was hung at “Applying User Settings” during the logon phase. You might ask what’s going to be new in this post, NTDebugging has bunch of posts covering such debug scenarios J. In my defense, this case was particularly interesting because it defied the general logic that I had about processing of network connection request. I am still a newbie in this area getting my feet wet in the arena of debugging. I had to seek help from one of the GES guru Aman Jain. He is an expert in the Windows Networking Stack. He helped me in confirming my suspicion.

 

As we were stuck at “Applying User Settings”, the obvious path was to look at the activity of Winlogon to understand why is it stuck? I could see that there were seven logon requests which haven’t been processed yet.

 

1: kd> !session

Sessions on machine: 7

Valid Sessions: 0 1 2 3 4 5 6

 

1: kd> !process 0 0 winlogon.exe

PROCESS fffffa806e132c10

    SessionId: 1  Cid: 0384    Peb: 7fffffdf000  ParentCid: 0298

    DirBase: db59d000  ObjectTable: fffff88007fd0430  HandleCount: 139.

    Image: winlogon.exe

<snip>

 

The above command returns 7 instances of Winlogon.exe, one instance per session. I picked one of the sessions as all the 7 sessions were in the same state.

 

As this was a Windows 2008 Server, winlogon would open an LPC Channel with Logonui.exe to process the user’s logon and then it will call into Local Session Manager (LSM.EXE) to spawn the users’ session. I could see that winlogon is waiting on message reply from Logonui.exe.

 

1: kd> !thread fffffa80`73d3d770

THREAD fffffa8073d3d770  Cid 0384.1e2c  Teb: 000007fffffd7000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

    fffffa8073d3db00  Semaphore Limit 0x1

Waiting for reply to ALPC Message fffff8801ffd9cf0 : queued at port fffffa804e9d9370 : owned by process fffffa806e134c10

Not impersonating

DeviceMap                 fffff880000073a0

Owning Process            fffffa806e132c10       Image:         winlogon.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

Context Switch Count      2            

UserTime                  00:00:00.000

KernelTime                00:00:00.000

Win32 Start Address 0x0000000076f67cb0

Stack Init fffffa60110a9db0 Current fffffa60110a9670

Base fffffa60110aa000 Limit fffffa60110a4000 Call 0

Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`110a96b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`110a97f0 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`110a9860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

fffffa60`110a98f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

fffffa60`110a9980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

fffffa60`110a99e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

fffffa60`110a9b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

fffffa60`110a9bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

00000000`01f8f098 00000000`00000000 0x76f9756a

 

1: kd> !alpc /m fffff8801ffd9cf0

 

Message @ fffff8801ffd9cf0

  MessageID             : 0x0464 (1124)

  CallbackID            : 0x2844F0 (2639088)

  SequenceNumber        : 0x0000000B (11)

  Type                  : LPC_REQUEST

  DataLength            : 0x0064 (100)

  TotalLength           : 0x008C (140)

  Canceled              : No

  Release               : No

  ReplyWaitReply        : No

  Continuation          : Yes

  OwnerPort             : fffffa80481ba9b0 [ALPC_CLIENT_COMMUNICATION_PORT]

  WaitingThread         : fffffa8073d3d770

  QueueType             : ALPC_MSGQUEUE_PENDING

  QueuePort             : fffffa804e9d9370 [ALPC_CONNECTION_PORT]

  QueuePortOwnerProcess : fffffa806e134c10 (LogonUI.exe)

  ServerThread          : fffffa804f96d5a0                  <----------- LPC Message Server Thread

  QuotaCharged          : No

  CancelQueuePort       : 0000000000000000

  CancelSequencePort    : 0000000000000000

  CancelSequenceNumber  : 0x00000000 (0)

  ClientContext         : 0000000000243d00

  ServerContext         : 0000000000000000

  PortContext           : 0000000001c27600

  CancelPortContext     : 0000000000000000

  SecurityData          : 0000000000000000

  View                  : 0000000000000000

 

1: kd> !thread fffffa804f96d5a0

THREAD fffffa804f96d5a0  Cid 01ac.09fc  Teb: 000007fffffd7000 Win32Thread: fffff900c067dd50 WAIT: (UserRequest) UserMode Non-Alertable

    fffffa80516be920  NotificationEvent

Not impersonating

DeviceMap                 fffff880000073a0

Owning Process            fffffa806e134c10       Image:         LogonUI.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

Context Switch Count      13                 LargeStack

UserTime                  00:00:00.000

KernelTime                00:00:00.000

Win32 Start Address 0x000007fefed47780

Stack Init fffffa60114e4db0 Current fffffa60114e4940

Base fffffa60114e5000 Limit fffffa60114df000 Call 0

Priority 14 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`114e4980 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`114e4ac0 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`114e4b30 fffff800`02119608 nt!KeWaitForSingleObject+0x2cb

fffffa60`114e4bc0 fffff800`01eb8ef3 nt!NtWaitForSingleObject+0x98

fffffa60`114e4c20 00000000`76f96d5a nt!KiSystemServiceCopyEnd+0x13

00000000`0444ef18 00000000`00000000 0x76f96d5a

 

This server thread is processing a message, but we can’t tell what it is waiting for. If this thread was finished processing a message we would have seen “WAIT: (WrLpcReceive)” for the thread wait reason. Often this type of wait is for a critical section or other synchronization object that is waiting for another thread in the process. This begged an investigation of Logonui.exe, to see what the other threads were doing.

 

I noticed that another Logonui.exe thread is waiting on LPC Message processing by a thread in LSM.EXE.

 

1: kd> !thread fffffa80`4e9e5060

THREAD fffffa804e9e5060  Cid 01ac.0438  Teb: 000007fffffd9000 Win32Thread: fffff900c06db2e0 WAIT: (WrLpcReply) UserMode Non-Alertable

    fffffa804e9e53f0  Semaphore Limit 0x1

Waiting for reply to ALPC Message fffff8801ffaacf0 : queued at port fffffa804e9b6060 : owned by process fffffa806e130c10

Not impersonating

DeviceMap                 fffff880000073a0

Owning Process            fffffa806e134c10       Image:         LogonUI.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      4192488        Ticks: 1881 (0:00:00:29.343)

Context Switch Count      165                 LargeStack

UserTime                  00:00:00.000

KernelTime                00:00:00.000

Win32 Start Address 0x000007fefbaa151c

Stack Init fffffa6008d26db0 Current fffffa6008d26670

Base fffffa6008d27000 Limit fffffa6008d20000 Call 0

Priority 15 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`08d266b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`08d267f0 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`08d26860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

fffffa60`08d268f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

fffffa60`08d26980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

fffffa60`08d269e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

fffffa60`08d26b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

fffffa60`08d26bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

00000000`0272ea78 00000000`00000000 0x76f9756a

 

At this point I decided to follow the LPC message chain trail to see where it ends and who is ultimately blocking it.

 

1: kd> !alpc /m fffff8801ffaacf0

 

Message @ fffff8801ffaacf0

  MessageID             : 0x0474 (1140)

  CallbackID            : 0x2844F6 (2639094)

  SequenceNumber        : 0x0000000D (13)

  Type                  : LPC_REQUEST

  DataLength            : 0x0058 (88)

  TotalLength           : 0x0080 (128)

  Canceled              : No

  Release               : No

  ReplyWaitReply        : No

  Continuation          : Yes

  OwnerPort             : fffffa804ea03c80 [ALPC_CLIENT_COMMUNICATION_PORT]

  WaitingThread         : fffffa804e9e5060

  QueueType             : ALPC_MSGQUEUE_PENDING

  QueuePort             : fffffa804e9b6060 [ALPC_CONNECTION_PORT]

  QueuePortOwnerProcess : fffffa806e130c10 (lsm.exe)

  ServerThread          : fffffa805180c700            <---- LPC Server thread for this message in LSM.EXE

  QuotaCharged          : No

  CancelQueuePort       : 0000000000000000

  CancelSequencePort    : 0000000000000000

  CancelSequenceNumber  : 0x00000000 (0)

  ClientContext         : 0000000001cf0e90

  ServerContext         : 0000000000000000

  PortContext           : 0000000001dc9df0

  CancelPortContext     : 0000000000000000

  SecurityData          : 0000000000000000

  View                  : 0000000000000000

 

I noticed that the server thread which is supposed to process the LPC message for above thread is itself waiting on a message reply from a thread in Svchost.exe. LSM had bunch of other threads which were waiting on LPC message processing from server threads in Svchost.exe. I dumped out all the threads and their LPC message structure to locate the server threads in Svchost.exe which seems to be doing some activity and found that only for the thread below the server thread had some interesting information as other LPC Server threads in svchost.exe were Notification Timers.

 

1: kd> !thread fffffa80`50910060

THREAD fffffa8050910060  Cid 02e0.1a70  Teb: 000007fffffda000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

    fffffa80509103f0  Semaphore Limit 0x1

Waiting for reply to ALPC Message fffff8801fe00cf0 : queued at port fffffa804e876b60 : owned by process fffffa804ea4fc10

Not impersonating

DeviceMap                 fffff880000073a0

Owning Process            fffffa806e130c10       Image:         lsm.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      3187894        Ticks: 1006475 (0:04:21:41.110)

Context Switch Count      97            

UserTime                  00:00:00.000

KernelTime                00:00:00.000

Win32 Start Address 0x000007fefed47780

Stack Init fffffa60125b1db0 Current fffffa60125b1670

Base fffffa60125b2000 Limit fffffa60125ac000 Call 0

Priority 9 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

Kernel stack not resident.

Child-SP          RetAddr           Call Site

fffffa60`125b16b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`125b17f0 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`125b1860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

fffffa60`125b18f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

fffffa60`125b1980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

fffffa60`125b19e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

fffffa60`125b1b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

fffffa60`125b1bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

00000000`0205e538 00000000`00000000 0x76f9756a

 

I looked at the server thread for this LPC message and that was also waiting on a notification event doing nothing. It again begged an investigation of the state of Svchost.exe to find out why the server thread hasn’t replied yet?

 

1: kd> !alpc /m fffff8801fe00cf0

 

Message @ fffff8801fe00cf0

  MessageID             : 0x03C4 (964)

  CallbackID            : 0x2140C2 (2179266)

  SequenceNumber        : 0x00000021 (33)

  Type                  : LPC_REQUEST

  DataLength            : 0x00C0 (192)

  TotalLength           : 0x00E8 (232)

  Canceled              : No

  Release               : No

  ReplyWaitReply        : No

  Continuation          : Yes

  OwnerPort             : fffffa804ef15d30 [ALPC_CLIENT_COMMUNICATION_PORT]

  WaitingThread         : fffffa8050910060

  QueueType             : ALPC_MSGQUEUE_PENDING

  QueuePort             : fffffa804e876b60 [ALPC_CONNECTION_PORT]

  QueuePortOwnerProcess : fffffa804ea4fc10 (svchost.exe)

  ServerThread          : fffffa8051435940

  QuotaCharged          : No

  CancelQueuePort       : 0000000000000000

  CancelSequencePort    : 0000000000000000

  CancelSequenceNumber  : 0x00000000 (0)

  ClientContext         : 00000000021a0f60

  ServerContext         : 0000000000000000

  PortContext           : 00000000033834d0

  CancelPortContext     : 0000000000000000

  SecurityData          : 0000000000000000

  View                  : 0000000000000000

 

1: kd> !thread fffffa8051435940

THREAD fffffa8051435940  Cid 0574.1ea0  Teb: 000007fffffd3000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

    fffffa601086d3e8  SynchronizationEvent

IRP List:

    fffffa80514558b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

Impersonation token:  fffff8801fc4b050 (Level Impersonation)

DeviceMap                 fffff8800e7c50e0

Owning Process            fffffa804ea4fc10       Image:         svchost.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      3187932        Ticks: 1006437 (0:04:21:40.517)

Context Switch Count      111            

UserTime                  00:00:00.015

KernelTime                00:00:00.000

Win32 Start Address 0x000007fefed47780

Stack Init fffffa601086ddb0 Current fffffa601086d040

Base fffffa601086e000 Limit fffffa6010868000 Call 0

Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`1086d080 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`1086d1c0 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`1086d230 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

fffffa60`1086d2c0 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

fffffa60`1086d320 fffffa60`015da75a nt!FsRtlCancellableWaitForSingleObject+0x27

fffffa60`1086d360 fffffa60`015d8588 mup!MupLocateUncProviderForPath+0x40a

fffffa60`1086d4a0 fffffa60`015d8130 mup!MupiCallUncProvider+0xe8

fffffa60`1086d510 fffffa60`015d90ef mup!MupStateMachine+0x120

fffffa60`1086d560 fffffa60`00f31e91 mup!MupCreate+0x2c3

fffffa60`1086d5e0 fffffa60`00f4e26c fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x211

fffffa60`1086d650 fffff800`021267d9 fltmgr!FltpCreate+0x25d

fffffa60`1086d700 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

fffffa60`1086d8a0 fffff800`0211f842 nt!ObpLookupObjectName+0x593

fffffa60`1086d9b0 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

fffffa60`1086da80 fffff800`0212ecc8 nt!IopCreateFile+0x287

fffffa60`1086db20 fffff800`01eb8ef3 nt!NtCreateFile+0x78

fffffa60`1086dbb0 00000000`76f9726a nt!KiSystemServiceCopyEnd+0x13

00000000`2674d738 00000000`00000000 0x76f9726a

 

Bingo! I found two threads in Svchost.exe which showed some activity and were blocked for over 4 hours on a UNC Path access. From the IRP which was issued by the thread above I found out that it was attempting to contact one of the DCs.

 

1: kd> !irp 0xfffffa80`514558b0

Irp is active with 4 stacks 3 is current (= 0xfffffa8051455a10)

No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

[  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

[  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

>[  0, 0]   0 e0 fffffa8031d1a840 fffffa805145aec0 fffffa6000f332b0-fffffa8047b6c930 Success Error Cancel

             \FileSystem\Mup  fltmgr!FltpSynchronizedOperationCompletion

                  Args: fffffa601086d848 01000040 00030000 00000000

[  0, 0]   0  0 fffffa8031d1b880 fffffa805145aec0 00000000-00000000   

             \FileSystem\FltMgr

                  Args: fffffa601086d848 01000040 00030000 00000000

 

1: kd> dt nt!_FILE_OBJECT fffffa805145aec0 FileName

   +0x058 FileName : _UNICODE_STRING "\DC005.contoso.com\PIPE\lsarpc"

 

For the details on MUP you can refer to Support for UNC Naming and MUP.

 

At first my thought was that something is wrong with the network stack because we are waiting on domain controller to respond. The thought was that either we had a networking issue or the Domain Controller wasn’t responding, but redirector would have timed out as it won’t wait for over 4 hours for an unresponsive connection. I started digging into the source code and found that we build a query context in MupLocateUncProviderForPath () function and issue another IRP which is sent down the I/O Stack. I searched for other IRPs which was issued by the above thread and found two IRPs.

 

1: kd> !irpfind 0 0 thread fffffa8051435940

Looking for IRPs with thread == fffffa8051435940

 

Scanning large pool allocation table for Tag: Irp? (fffffa804ea83000 : fffffa804ed83000)

 

 

Searching NonPaged pool (fffffa8030605000 : fffffa8c27e00000) for Tag: Irp?

 

  Irp    [ Thread ] irpStack: (Mj,Mn)   DevObj  [Driver]         MDL Process

fffffa80514558b0 [fffffa8051435940] irpStack: ( 0, 0)  fffffa8031d1a840 [ \FileSystem\Mup]

fffffa8073bc65e0 [fffffa8051435940] irpStack: ( e, 0)  fffffa804e842610 [ \FileSystem\mrxsmb]

  

1: kd> !irp fffffa8073bc65e0

Irp is active with 2 stacks 1 is current (= 0xfffffa8073bc66b0)

 No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

>[  e, 0]   0 e1 fffffa804e842610 fffffa804e843f20 fffffa60015d4024-fffffa601086d3b0 Success Error Cancel pending

             \FileSystem\mrxsmb     mup!MupiQueryPathCompletionRoutine

                  Args: 00000004 00000050 00140193 fffffa601086d408

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

 

This is the IRP that MUP has issued an IOCTL IOCTL_REDIR_QUERY_PATH to the redirector.  If the status returned by the driver is anything other than STATUS_PENDING then it will wait by calling FsRtlCancellableWaitForSingleObject.

 

The 0xfffffa8051435940 thread doesn’t have any redirector function in its stack, which made me suspicious that there could be some other thread which is already working on similar operation and redirector is currently busy processing that request. I started looking for a worker thread which would be calling the redirector.

 

1: kd> !stacks 0 mrxsmb

Proc.Thread  .Thread  Ticks   ThreadState Blocker

                            [fffffa804e9c7c10 svchost.exe]

 248.000e84  fffffa804ee72bb0 ffe0d0ad Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

                            [fffffa804e9e12e0 svchost.exe]

 404.000f18  fffffa804edad4f0 ffe03e7e Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

 

The above command will parse through all the kernel stacks for all the process and dump out one line summary info for the threads where it finds reference to that module in its stack. I have snipped the output to keep things in context.

 

One of the sample threads where redirector is in action is shown below.

 

1: kd> !thread fffffa804ee72bb0

THREAD fffffa804ee72bb0  Cid 0248.0e84  Teb: 000007fffffa8000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

    fffffa8048f6c6b8  SynchronizationEvent

IRP List:

    fffffa8073ac3980: (0006,03a0) Flags: 00000884  Mdl: 00000000

    fffffa807257fc60: (0006,03a0) Flags: 00000884  Mdl: 00000000

    fffffa80511627b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

Impersonation token:  fffff8800d515c40 (Level Impersonation)

Owning Process            fffffa804e9c7c10       Image:         svchost.exe

Attached Process          N/A            Image:         N/A

Wait Start TickCount      2043731        Ticks: 2150638 (0:09:19:10.167)

Context Switch Count      893            

UserTime                  00:00:00.031

KernelTime                00:00:00.109

Win32 Start Address 0x000007fefb96c534

Stack Init fffffa60125a1db0 Current fffffa60125a0ab0

Base fffffa60125a2000 Limit fffffa601259c000 Call 0

Priority 10 BasePriority 1 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`125a0af0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`125a0c30 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`125a0ca0 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

fffffa60`125a0d30 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

fffffa60`125a0d90 fffffa60`0f625c14 nt!FsRtlCancellableWaitForSingleObject+0x27

fffffa60`125a0dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

fffffa60`125a0e10 fffffa60`0f63599f rdbss!RxFindOrCreateConnections+0x572

fffffa60`125a0ef0 fffffa60`0f627318 rdbss!RxConstructVirtualNetRoot+0xb7

fffffa60`125a0f90 fffffa60`0f63815a rdbss!RxFindOrConstructVirtualNetRoot+0x594

fffffa60`125a1100 fffffa60`0f627eee rdbss!RxCreateTreeConnect+0x13e

fffffa60`125a1190 fffffa60`0f6041f6 rdbss!RxCommonCreate+0x20a

fffffa60`125a1250 fffffa60`0f62568a rdbss!RxFsdCommonDispatch+0x786

fffffa60`125a1340 fffffa60`1031f2cf rdbss!RxFsdDispatch+0x21a

fffffa60`125a13b0 fffffa60`015d85f9 mrxsmb!MRxSmbFsdDispatch+0xbf

fffffa60`125a13f0 fffffa60`015d8130 mup!MupiCallUncProvider+0x159

fffffa60`125a1460 fffffa60`015d90ef mup!MupStateMachine+0x120

fffffa60`125a14b0 fffffa60`00f4e342 mup!MupCreate+0x2c3

fffffa60`125a1530 fffff800`021267d9 fltmgr!FltpCreate+0x333

fffffa60`125a15e0 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

fffffa60`125a1780 fffff800`0211f842 nt!ObpLookupObjectName+0x593

fffffa60`125a1890 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

fffffa60`125a1960 fffff800`0212ecc8 nt!IopCreateFile+0x287

fffffa60`125a1a00 fffff800`01eb8ef3 nt!NtCreateFile+0x78

fffffa60`125a1a90 fffff800`01eb9400 nt!KiSystemServiceCopyEnd+0x13

fffffa60`125a1c98 fffffa60`0f664a5e nt!KiServiceLinkage

<snip>

 

The above thread is waiting for the condition of the remote connection to become stable and that’s the reason it called the function RxWaitForStableCondition(), which accepts RX_CONTEXT as its first argument and the reason for why we entered into the wait as second argument. As we can see that the thread above was waiting for over 9 hours and 19 minutes.

 

For additional reading you can go through RX_CONTEXT and IRP Management.

 

The unassembled code for the function below is to find out where we store the pointer to the RX_CONTEXT structure. You may remember about the challenges of x64 debugging from this post.

 

1: kd> .frame /r 05

05 fffffa60`11e42dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

rax=0000000000000000 rbx=fffffa804f4de010 rcx=0000000000000000

rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000

rip=fffffa600f625c14 rsp=fffffa6011e42dd0 rbp=fffffa805113a490

r8=0000000000000000  r9=0000000000000000 r10=0000000000000000

r11=0000000000000000 r12=fffffa805113a4a0 r13=fffffa6011e42fd8

r14=fffffa804f4de010 r15=0000000000000000

iopl=0         nv up di pl nz na pe nc

cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

rdbss!RxWaitForStableCondition+0x11c:

fffffa60`0f625c14 85c0            test    eax,eax

 

1: kd> u rdbss!RxWaitForStableCondition

rdbss!RxWaitForStableCondition:

fffffa60`0f625af8 488bc4          mov     rax,rsp

fffffa60`0f625afb 48895808        mov     qword ptr [rax+8],rbx

fffffa60`0f625aff 48896810        mov     qword ptr [rax+10h],rbp

fffffa60`0f625b03 48897018        mov     qword ptr [rax+18h],rsi

fffffa60`0f625b07 48897820        mov     qword ptr [rax+20h],rdi

fffffa60`0f625b0b 4154            push    r12

fffffa60`0f625b0d 4883ec30        sub     rsp,30h

fffffa60`0f625b11 33ff            xor     edi,edi

1: kd> u

rdbss!RxWaitForStableCondition+0x1b:

fffffa60`0f625b13 833a03          cmp     dword ptr [rdx],3

fffffa60`0f625b16 418af1          mov     sil,r9b

fffffa60`0f625b19 4d8be0          mov     r12,r8

fffffa60`0f625b1c 488bea          mov     rbp,rdx

fffffa60`0f625b1f 488bd9          mov     rbx,rcx     <--------- RX_CONTEXT is in RBX register

fffffa60`0f625b22 7c07            jl      rdbss!RxWaitForStableCondition+0x33 (fffffa60`0f625b2b)

fffffa60`0f625b24 33c0            xor     eax,eax

fffffa60`0f625b26 e968010000      jmp     rdbss!RxWaitForStableCondition+0x19b (fffffa60`0f625c93)

 

1: kd> !pool fffffa804f4de010 2           <---- [Argument 1] Here is pointer to RX_CONTEXT Structure

Pool page fffffa804f4de010 region is Nonpaged pool

*fffffa804f4de000 size:  670 previous size:    0  (Allocated) *RxIr

            Pooltag RxIr : RDBSS RxContext, Binary : rdbss.sys

 

1: kd> !pool fffffa805113a490 2           <---- [Argument 2] Here is the Pointer to the Reason for Wait

Pool page fffffa805113a490 region is Nonpaged pool

*fffffa805113a3b0 size:  280 previous size:   10  (Allocated) *RxSc

            Pooltag RxSc : RDBSS SrvCall, Binary : rdbss.sys

 

1: kd> dc fffffa805113a490 L1

fffffa80`5113a490  00000001               <---------------- Transport is in Transition state

 

We have bunch of RX_CONTEXT describing the IRPs pending on redirector for the transport of the remote connection to stabilize.

 

Next obvious thought that came to my mind was to investigate the list where we store the IRPs which have been issued by the redirector for the transport driver to process and understand the state of those IRPs and understand why we are not progressing further? RxCeSend is the API which submits the IRPs to the Transport driver which is maintained in the list below. For more information about Redirector related APIs you can refer Connection Engine Management.

 

1: kd> x rdbss!RxIrpsList

fffffa60`0f620110 rdbss!RxIrpsList = <no type information>

 

The rdbss!RxIrpsList is a pointer to the structure of type LIST_ENTRY.

 

1: kd> dq fffffa60`0f620110 L2

fffffa60`0f620110  fffffa80`510fd0d0 fffffa80`4edb5590

 

1: kd> dq fffffa80`510fd0d0 L2            <---------------- LIST_ENTRY.FLINK 

fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

 

1: kd> dq fffffa80`4edb5590 L2            <---------------- LIST_ENTRY.BLINK

fffffa80`4edb5590  fffffa60`0f620110 fffffa80`4ff450d0

 

This being a x64 windows each entry in this list is 0x30 bytes, this can change with OS version or driver updates, so don’t rely on this list or any of the offset that I referenced here in your driver code. Entry in the first column of this list is a pointer to the list entry. At offset 10 we store the pointer to the IRP. Add 0x10 bytes to get the IRP.

 

1: kd> dl 0xfffffa80`510fd0d0

fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

fffffa80`510fd0e0  fffffa80`7274e5a0 00000000`00000000

fffffa80`50d2e300  fffffa80`50ee4490 fffffa80`510fd0d0

fffffa80`50d2e310  fffffa80`50f16160 00000000`00000000

fffffa80`50ee4490  fffffa80`727f1140 fffffa80`50d2e300

fffffa80`50ee44a0  fffffa80`51491760 00000000`00000000

fffffa80`727f1140  fffffa80`73a95860 fffffa80`50ee4490

fffffa80`727f1150  fffffa80`511b9c60 00000000`00000000

fffffa80`73a95860  fffffa80`73a8e3e0 fffffa80`727f1140

fffffa80`73a95870  fffffa80`73a77c60 00000000`00000000

fffffa80`73a8e3e0  fffffa80`73a9c290 fffffa80`73a95860

fffffa80`73a8e3f0  fffffa80`73a46a30 00000000`00000000

fffffa80`73a9c290  fffffa80`725ca490 fffffa80`73a8e3e0

<Snip>

 

1: kd> !irp poi(fffffa80`510fd0d0+10)

Irp is active with 2 stacks 2 is current (= 0xfffffa807274e6b8)

 No Mdl: No System Buffer: Thread 00000000:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

>[  f, 3]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa601030df70-fffffa805119fda0 Success Error Cancel pending

             \Driver\Smb      mrxsmb!RxTdiAsynchronousConnectCompletion

                  Args: 00000000 fffffa8050d127e0 00000000 fffffa601031b6f0

 

1: kd> !irp poi(fffffa80`50d2e300+10)

Irp is active with 2 stacks 2 is current (= 0xfffffa8050f16278)

 No Mdl: System buffer=fffffa8050d59c88: Thread 00000000:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

>[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa8050d59c60 Success Error Cancel pending

             \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                  Args: 00000000 00000030 0021c0e8 00000000

 

1: kd> !irp poi(fffffa80`50ee4490+10)

Irp is active with 2 stacks 2 is current (= 0xfffffa8051491878)

 No Mdl: System buffer=fffffa805115dcb8: Thread 00000000:  Irp stack trace. 

     cmd  flg cl Device   File     Completion-Context

 [  0, 0]   0  0 00000000 00000000 00000000-00000000   

 

                  Args: 00000000 00000000 00000000 00000000

>[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa805115dc90 Success Error Cancel pending

             \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                  Args: 00000000 00000030 0021c0e8 00000000

 

All these IRPs are waiting on SMB Transport Driver (responsible for handling TCP/IP port 445 related traffic) for further processing. The obvious feeling was to find out if there are threads which are making call to SMB function to see what they are they doing?

 

Note: In Windows 2008 R2 and Windows 7 onwards redirector doesn’t use SMB as the transport driver directly.

 

I searched through the dump but I couldn’t find a single thread which was calling any of the SMB function. This is the point where it gets more interesting. Why we don’t have any thread which is calling SMB function when we have bunch of IRPs pending on that driver?

 

This was the time for me to collaborate with the Network Stack Guru Aman Jain to get some insight to figure out the answer to my question.

 

It turned out that SMB driver will create a work item and it will use the System Delayed Worker Queue to queue the work item that it created and will populate the routine in the work Item structure which is supposed to process the work item once the worker thread picks up the work item.

 

1: kd> x nt!ExWorkerQueue

fffff800`01ff08a0 nt!ExWorkerQueue = <no type information>

 

Above pointer is the list head for the 3 queues that system has namely the “Critical WorkQueue” , “Delayed WorkQueue” and the “HyperCritical WorkQueue” as see in the output of !exqueue command in the debugger. Each entry in this list is 0x58 bytes long and is of type _KQUEUE.

 

Structure referenced below decides how a various work queues are enumerated. This structure is documented in this MSDN article.

 

typedef enum _WORK_QUEUE_TYPE {

  CriticalWorkQueue        = 0,

  DelayedWorkQueue         = 1,

  HyperCriticalWorkQueue   = 2,

  MaximumWorkQueue         = 3

} WORK_QUEUE_TYPE;

 

Critical WorkQueue

1: kd> dt nt!_KQUEUE fffff800`01ff08a0   

   +0x000 Header           : _DISPATCHER_HEADER

   +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff08b8 - 0xfffff800`01ff08b8 ]

   +0x028 CurrentCount     : 0

   +0x02c MaximumCount     : 0x10

   +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4da08 - 0xfffffa80`30a39d68 ]

 

Delayed WorkQueue

1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58

   +0x000 Header           : _DISPATCHER_HEADER

   +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffffa80`50226f00 - 0xfffffa80`5183f9f0 ]

   +0x028 CurrentCount     : 0

   +0x02c MaximumCount     : 0x10

   +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a398d8 - 0xfffffa80`51157d68 ]

 

HyperCritical WorkQueue

1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58+58

   +0x000 Header           : _DISPATCHER_HEADER

   +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff0968 - 0xfffff800`01ff0968 ]

   +0x028 CurrentCount     : 0

   +0x02c MaximumCount     : 0x10

   +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4e1f8 - 0xfffffa80`30a4e1f8 ]

 

1: kd> dd fffff80001ff08a0+58+40 L1

fffff800`01ff0938  00000010 <----------- We created 0x10 (16) dynamic threads

 

We create additional dynamic threads as in the event when the default number of threads in a particular work queue is unable to handle the Work Items being queued. In this case we created 16 additional threads, which is the maximum number of dynamic thread which can be created.

 

1: kd> dd fffff80001ff08a0+58+4c L1

fffff800`01ff0944  00004667

 

1: kd> ? 00004667

Evaluate expression: 18023 = 00000000`00004667  <-------- 18,023 work Items pending for processing

 

I could see that there are work items pending in the Delayed WorkQueue because of the following relation which is not present in any other WorkQueue that we dumped.

 

nt!_KQUEUE->EntryListHead.Flink != nt!_QUEUE->EntryListHead.Blink

 

For details about Work Item Processing and Work Items you can refer following docs:

System Worker Threads

IO_WORKITEM

IoSizeofWorkItem

 

The next obvious steps involved dumping all work items pending in the list to see if I have work items related to SMB driver? And if the Work Items are present for SMB then why they haven’t been picked up by the worker thread?

 

1: kd> !list "-t nt!_LIST_ENTRY.Flink -e -x\"dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1\" fffffa80`50226f00"

 

<Snip>

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`4b7c7e30  fffffa80`50a720e0

fffffa80`4b7c7e50  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`72544e00  fffffa80`31aeb580

fffffa80`72544e20  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`72759d50  fffffa80`51184690

fffffa80`72759d70  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`73b416e0  fffffa80`5143c9c0

fffffa80`73b41700  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

 

<snip>

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`518392e0  fffffa80`31fafa60

fffffa80`51839300  fffffa60`00becb58 mpio!MPIOPathVerify

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`31fafa60  fffffa80`31fafab0

fffffa80`31fafa80  fffffa60`00becb58 mpio!MPIOPathVerify

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`31fafab0  fffffa80`31fafc30

fffffa80`31fafad0  fffffa60`00becb58 mpio!MPIOPathVerify

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`31fafc30  fffffa80`5183f9a0

fffffa80`31fafc50  fffffa60`00becb58 mpio!MPIOPathVerify

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`5183f9a0  fffffa80`4ea092e0

fffffa80`5183f9c0  fffffa60`00becb58 mpio!MPIOPathVerify

 

dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

fffffa80`4ea092e0  fffffa80`5183f9f0

fffffa80`4ea09300  fffffa60`00becb58 mpio!MPIOPathVerify

 

Voila! Here are my SMB work items in the Delayed WorkQueue waiting to be processed. Another interesting observation was that this queue had around 18,023 work items in this queue waiting to be picked up by the worker threads. Most of the work items were related to MPIO. Investigation of the worker threads in the Delayed WorkQueue revealed that the worker threads were blocked processing MPIO work item on the DSM driver for the storage.

 

I picked one of the worker entry from the _KQUEUE->ThreadListHead for the Delayed WorkQueue and validated that it is a thread pool allocation with !pool. As the size of _ETHREAD structure is 0x450 bytes and the pool allocation was for 0x490 bytes, this indicates that the pointer to the actual thread object would be 0x40 bytes into that pool allocation from the beginning of that pool block.

 

1: kd> !pool 0xfffffa80`30a398d8  2

Pool page fffffa8030a398d8 region is Nonpaged pool

*fffffa8030a396e0 size:  490 previous size:   90  (Allocated) *Thre (Protected)

            Pooltag Thre : Thread objects, Binary : nt!ps

 

1: kd> ?? sizeof(nt!_ETHREAD)

unsigned int64 0x450

 

1: kd> !thread fffffa8030a396e0+40

THREAD fffffa8030a39720  Cid 0004.0044  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

    fffffa8031e8fe58  SynchronizationEvent

Not impersonating

DeviceMap                 fffff880000073a0

Owning Process            fffffa80309ca110       Image:         System

Attached Process          N/A            Image:         N/A

Wait Start TickCount      2015907        Ticks: 2178462 (0:09:26:24.225)

Context Switch Count      42461            

UserTime                  00:00:00.000

KernelTime                00:00:14.976

Win32 Start Address nt!ExpWorkerThread (0xfffff80001ec07c8)

Stack Init fffffa600232bdb0 Current fffffa600232b800

Base fffffa600232c000 Limit fffffa6002326000 Call 0

Priority 15 BasePriority 12 PriorityDecrement 0 IoPriority 2 PagePriority 5

Child-SP          RetAddr           Call Site

fffffa60`0232b840 fffff800`01ebb6fa nt!KiSwapContext+0x7f

fffffa60`0232b980 fffff800`01eb035b nt!KiSwapThread+0x13a

fffffa60`0232b9f0 fffffa60`00c9c698 nt!KeWaitForSingleObject+0x2cb

fffffa60`0232ba80 fffffa60`00c97a06 MyBadDSM+0xb698

fffffa60`0232bb10 fffffa60`00becca3 MyBadDSM+0x6a06

fffffa60`0232bb60 fffff800`02119693 mpio!MPIOPathVerify+0x14b

fffffa60`0232bcc0 fffff800`01ec08c3 nt!IopProcessWorkItem+0x27

fffffa60`0232bcf0 fffff800`020c3ee7 nt!ExpWorkerThread+0xfb

fffffa60`0232bd50 fffff800`01ef6616 nt!PspSystemThreadStartup+0x57

fffffa60`0232bd80 00000000`00000000 nt!KxStartSystemThread+0x16

 

At this point it was evident that something has gone wrong with either the storage or the HBA adapter, Inspection of the MyBadDSM.sys code at this point could reveal if it was the problem with the driver or something has gone wrong with the HBA or the Fiber connection to the storage.

 

This brings me to the end of this post. I hope you enjoyed reading this post. Goodbye. Until next time!

Leave a Comment
  • Please add 6 and 6 and type the answer here:
  • Post
  • That was beautiful Pushkar! BTW, MSDN and Windows Internals 5th Ed have little information regarding Work Item Processing. Would you refer to any other source?

    [Apart from MSDN and Windows Internals, the only reference that is available is the Help file which ships with the Windows Driver Kit, which can be downloaded from the URL below

    http://www.microsoft.com/whdc/DevTools/WDK/WDKpkg.mspx ]

  • Thanks for sharing, very detailed and explanatory for me.

  • Excellent article. This was very educational and I liked the links. I had no idea about the changes in Win2008 such as “logonui” and “lsm”. I also loved the details provided about UNC Mapping, MUP, Network Redirector (UNC Provider), and the fact that the SMB driver creates work item and use of the System Delayed Worker Queue.

  • Great article! BTW, `!exqueue` will dump the worker queue.

    [I know !exqueue will dump the worker queue but getting the count of a Pending Work Items in that queue is not possible with that extension, it just keeps on dumping all the pending work items. The methodology that I used gives me the exact number of Pending Work items in a particular queue] 

  • Great post; thanks for so much detail.

Page 1 of 1 (5 items)