Hello my name is Roy; I’m an EE on the Microsoft global escalation services / CPR team. This blog post continues my previous posts on LPC.
Below I will discuss scenarios that are simulated to depict actual issues encountered by Microsoft Customer Support Services. The two following scenarios deal with client threads waiting for LPC reply and hence in some way leading to user experience of a hung system or component.
CASE1 – When everything in LPC looks intact
In this scenario tracing the reason for client threads wait is possible through LPC data structures and extensions.
The issue is a hung SMS (Systems Management Server) service. The OS (operating system) itself is working fine. It is just SMS that has hung.
Taking a user mode hang dump (adplus.vbs hang mode dump) does not help since all we see is that threads in SMS are making RPC calls and are stuck. We need to investigate where these RPC calls are stuck. From the user dump we can see one important thing about the RPC calls is that they are implemented over LPC. LPC is completely implemented in the kernel and hence taking a kernel dump would be the next step in this case.
02e9e698 7c821c94 ntdll!KiFastSystemCallRet
02e9e69c 77c72700 ntdll!NtRequestWaitReplyPort+0xc <--- RPC implemented over LPC
02e9e6f4 77c713ba RPCRT4!LRPC_CCALL::SendReceive+0x230
02e9e700 77788056 RPCRT4!I_RpcSendReceive+0x24 <--- RPC calls made
02e9e71c 77788011 ole32!ThreadSendReceive+0xf5
02e9e738 77787ed7 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112
02e9e818 77695349 ole32!CRpcChannelBuffer::SendReceive2+0xc1
02e9e884 776952e1 ole32!CAptRpcChnl::SendReceive+0xab
02e9e8d8 77ce127e ole32!CCtxComChnl::SendReceive+0x113
02e9e8f4 77ce13ca RPCRT4!NdrProxySendReceive+0x43
02e9ecdc 77ce11bd RPCRT4!NdrClientCall2+0x206
02e9ecfc 77c53a12 RPCRT4!ObjectStublessClient+0x8b
02e9ed0c 77688fd2 RPCRT4!ObjectStubless+0xf
02e9ed70 7768915a ole32!CRpcResolver::CreateInstance+0x14e
02e9efbc 776a9b68 ole32!CClientContextActivator::CreateInstance+0xfa
02e9effc 776a9a9a ole32!ActivationPropertiesIn::DelegateCreateInstance+0xf7
02e9f7b0 776b1275 ole32!ICoCreateInstanceEx+0x3cc
02e9f7d8 776b1244 ole32!CComActivator::DoCreateInstance+0x28
02e9f7fc 74cf1a51 ole32!CoCreateInstanceEx+0x23
02e9f830 74cf19fe wbemprox!CDCOMTrans::DoActualCCI+0x3d
02e9f874 74cf16b1 wbemprox!CDCOMTrans::DoCCI+0x12d
02e9f934 74cf152e wbemprox!CDCOMTrans::DoActualConnection+0x25f
02e9f960 74cf22ab wbemprox!CDCOMTrans::DoConnection+0x25
02e9f9a0 1806eaa9 wbemprox!CLocator::ConnectServer+0x7c
WARNING: Stack unwind information not available. Following frames may be wrong.
02e9fa50 186c31b6 ccmcore!CCM::Utility::Wmi::Namespace::Open+0x1a2
02e9fb44 186c357f PolicyAgent!DllUnregisterServer+0x10d8
02e9fbb0 18706a63 PolicyAgent!DllUnregisterServer+0x14a1
02e9fc8c 18708899 PolicyAgentEndpoint!DllGetClassObject+0x73d2
02e9fd6c 187097d9 PolicyAgentEndpoint!DllGetClassObject+0x9208
02e9fde0 01068bb8 PolicyAgentEndpoint!DllGetClassObject+0xa148
02e9fe6c 7c83b332 CcmExec+0x68bb8
02e9fec8 7c8161d0 ntdll!RtlpWaitOrTimerCallout+0x74
02e9feec 7c81a3c5 ntdll!RtlpAsyncWaitCallbackCompletion+0x37
02e9ff44 7c8200fc ntdll!RtlpWorkerCallout+0x71
02e9ff64 7c81a3fa ntdll!RtlpExecuteWorkerRequest+0x4f
02e9ff78 7c82017f ntdll!RtlpApcCallout+0x11
02e9ffb8 77e6608b ntdll!RtlpWorkerThread+0x61
02e9ffec 00000000 kernel32!BaseThreadStart+0x34
We examine threads belonging to ccmexec.exe from the kernel dump.
0: kd> !process 0 0 ccmexec.exe
PROCESS 895a6430 SessionId: 0 Cid: 01c8 Peb: 7ffde000 ParentCid: 0278
DirBase: 6f3c7000 ObjectTable: e22d6b00 HandleCount: 832.
0: kd> .process /p /r 895a6430
Implicit process is now 895a6430
Loading User Symbols
0: kd> !thread 88e75b10
THREAD 88e75b10 Cid 01c8.0de0 Teb: 7ffa6000 Win32Thread: e2365ea8 WAIT: (WrLpcReply) UserMode Non-Alertable
88e75cfc Semaphore Limit 0x1
Waiting for reply to LPC MessageId 00039236:
Current LPC port e2327022
Owning Process 895a6430 Image: CcmExec.exe
Wait Start TickCount 99831 Ticks: 46825 (0:00:12:11.640)
Context Switch Count 14119 LargeStack
Win32 Start Address ntdll!RtlpWorkerThread (0x7c83ad38)
Start Address kernel32!BaseThreadStartThunk (0x77e6b5f3)
Stack Init b9425000 Current b9424c08 Base b9425000 Limit b9421000 Call 0
Priority 13 BasePriority 8 PriorityDecrement 0
Kernel stack not resident.
ChildEBP RetAddr Args to Child
b9424c20 8083e6a2 88e75b88 88e75b10 88e75bb8 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b9424c4c 8083f164 88e75cfc 88e75cd0 88e75b10 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b9424c94 8093983f 88e75cfc 00000011 b9424d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b9424d50 80834d3f 00000114 0017db40 0017db40 nt!NtRequestWaitReplyPort+0x776 (FPO: [Non-Fpo])
b9424d50 7c82ed54 00000114 0017db40 0017db40 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b9424d64)
02e9e698 7c821c94 77c72700 00000114 0017db40 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
02e9e69c 77c72700 00000114 0017db40 0017db40 ntdll!NtRequestWaitReplyPort+0xc (FPO: [3,0,0])
02e9e6f4 77c713ba 00127fd4 02e9e71c 77788056 RPCRT4!LRPC_CCALL::SendReceive+0x230 (FPO: [Non-Fpo])
02e9e700 77788056 00127fd4 02e9e800 0009f488 RPCRT4!I_RpcSendReceive+0x24 (FPO: [Non-Fpo])
02e9e71c 77788011 00000000 00000000 00000000 ole32!ThreadSendReceive+0xf5 (FPO: [Non-Fpo])
02e9e738 77787ed7 02e9e800 02e9e900 0009f488 ole32!CRpcChannelBuffer::SwitchAptAndDispatchCall+0x112 (FPO: [Non-Fpo])
02e9e818 77695349 0009f488 02e9e910 02e9e900 ole32!CRpcChannelBuffer::SendReceive2+0xc1 (FPO: [Non-Fpo])
02e9e884 776952e1 0009f488 02e9e910 02e9e900 ole32!CAptRpcChnl::SendReceive+0xab (FPO: [Non-Fpo])
02e9e8d8 77ce127e 0009f488 02e9e910 02e9e900 ole32!CCtxComChnl::SendReceive+0x113 (FPO: [Non-Fpo])
02e9e8f4 77ce13ca 000a1934 02e9e93c 0600016e RPCRT4!NdrProxySendReceive+0x43 (FPO: [Non-Fpo])
02e9ecdc 77ce11bd 77676298 7767951e 02e9ed14 RPCRT4!NdrClientCall2+0x206 (FPO: [Non-Fpo])
02e9ecfc 77c53a12 00000014 00000004 02e9ed70 RPCRT4!ObjectStublessClient+0x8b (FPO: [Non-Fpo])
02e9ed0c 77688fd2 000a1934 00000000 02e9f1b4 RPCRT4!ObjectStubless+0xf
02e9ed70 7768915a 02e9f1b4 02e9f704 00000010 ole32!CRpcResolver::CreateInstance+0x14e (FPO: [Non-Fpo])
02e9efbc 776a9b68 7778fcd0 00000000 00000000 ole32!CClientContextActivator::CreateInstance+0xfa (FPO: [Non-Fpo])
02e9effc 776a9a9a 02e9f1b4 00000000 02e9f704 ole32!ActivationPropertiesIn::DelegateCreateInstance+0xf7 (FPO: [Non-Fpo])
02e9f7b0 776b1275 74cf1a8c 00000000 00000014 ole32!ICoCreateInstanceEx+0x3cc (FPO: [Non-Fpo])
02e9f7d8 776b1244 74cf1a8c 00000000 00000014 ole32!CComActivator::DoCreateInstance+0x28 (FPO: [Non-Fpo])
02e9f7fc 74cf1a51 74cf1a8c 00000000 00000014 ole32!CoCreateInstanceEx+0x23 (FPO: [Non-Fpo])
02e9f830 74cf19fe 02e9f898 00000001 00000000 wbemprox!CDCOMTrans::DoActualCCI+0x3d (FPO: [Non-Fpo])
02e9f874 74cf16b1 02e9f898 00000001 00000000 wbemprox!CDCOMTrans::DoCCI+0x12d (FPO: [Non-Fpo])
02e9f934 74cf152e 01801974 00000000 00000000 wbemprox!CDCOMTrans::DoActualConnection+0x25f (FPO: [Non-Fpo])
02e9f960 74cf22ab 01801974 00000000 00000000 wbemprox!CDCOMTrans::DoConnection+0x25 (FPO: [Non-Fpo])
02e9f9a0 1806eaa9 00b594d0 01801974 00000000 wbemprox!CLocator::ConnectServer+0x7c (FPO: [Non-Fpo])
02e9fa50 186c31b6 0062bdaa 00000001 00000000 ccmcore!CCM::Utility::Wmi::Namespace::Open+0x1a2
02e9fb44 186c357f 02e9fb68 00000001 00000000 PolicyAgent!DllUnregisterServer+0x10d8
02e9fbb0 18706a63 0056a6f0 00000001 00000000 PolicyAgent!DllUnregisterServer+0x14a1
02e9fc8c 18708899 0063cd50 02e9fd0c 02e9fcec PolicyAgentEndpoint!DllGetClassObject+0x73d2
02e9fd6c 187097d9 ffffffff 00000864 00000000 PolicyAgentEndpoint!DllGetClassObject+0x9208
02e9fde0 01068bb8 00566a28 006af398 00000000 PolicyAgentEndpoint!DllGetClassObject+0xa148
02e9fe6c 7c83b332 000008b6 0011c700 0011c7f0 CcmExec+0x68bb8
02e9fec8 7c8161d0 01068a29 000008b6 0011c700 ntdll!RtlpWaitOrTimerCallout+0x74 (FPO: [Non-Fpo])
02e9feec 7c81a3c5 0011c7f0 7c889880 01e89b18 ntdll!RtlpAsyncWaitCallbackCompletion+0x37 (FPO: [Non-Fpo])
02e9ff44 7c8200fc 7c816199 0011c7f0 00000000 ntdll!RtlpWorkerCallout+0x71 (FPO: [Non-Fpo])
02e9ff64 7c81a3fa 00000000 0011c7f0 01e89b18 ntdll!RtlpExecuteWorkerRequest+0x4f (FPO: [Non-Fpo])
The !thread output contains some important information.
The next step is to find the server thread that is processing this LPC message. We will use LPC debugger extensions to trace the server thread.
0: kd> !lpc message 00039236
Searching message 39236 in threads ...
Server thread 89531020 is working on message 39236
Client thread 88e75b10 waiting a reply from 39236
Searching thread 88e75b10 in port rundown queues ...
Server communication port 0xe2300a38
Handles: 1 References: 1
The LpcDataInfoChainHead queue is empty
Connected port: 0xe2327020 Server connection port: 0xe13734b0
Client communication port 0xe2327020
Handles: 1 References: 9
Server connection port e13734b0 Name: epmapper
Handles: 1 References: 93
Server process : 896e1450 (svchost1.exe)
Queue semaphore : 8995ac10
Semaphore state 0 (0x0)
The message queue is empty
Messages in LpcDataInfoChainHead:
0000 e2365188 - Busy Id=0003832f From: 04a8.13f4 Context=80050003 [e1373530 . e297ab00]
Length=0044002c Type=00380001 (LPC_REQUEST)
Data: 00000001 00050242 421e5fa2 9b3b48b4 a695daf5 00000001
0000 e297ab00 - Busy Id=00039232 From: 01c8.14e4 Context=8021000d [e2365188 . e12cbea0]
Length=0044002c Type=00380001 (LPC_REQUEST)
Data: 00010001 00040242 b8d06b40 b8d06b00 000003f4 000010a0
0000 e12cbea0 - Busy Id=00039236 From: 01c8.0de0 Context=8021000d [e297ab00 . e1297430]
Data: 00010001 00040242 4195c844 8ff87eb3 2b70d0f6 001e001c
0000 e1297430 - Busy Id=0003c96d From: 0204.0224 Context=8022000e [e12cbea0 . e1534a48]
Data: 00010001 00040242 b8bb1b40 b8bb1b00 000003f4 00000fbc
0000 e1534a48 - Busy Id=0003db3b From: 01c8.10e0 Context=8021000d [e1297430 . e2515ca0]
Data: 00010001 00040242 0000dc04 000001c8 7b6ee790 9e8a0590
0000 e2515ca0 - Busy Id=0003db3f From: 01c8.0f5c Context=8021000d [e1534a48 . e29279e8]
0000 e29279e8 - Busy Id=0003db5a From: 01c8.1618 Context=8021000d [e2515ca0 . e1061008]
Data: 00010001 00040242 00000000 00000000 00000000 00000000
0000 e1061008 - Busy Id=0004617f From: 01c8.1220 Context=8021000d [e29279e8 . e295fa80]
Data: 00010001 00040242 00010c20 000001c8 37bd6287 35ebbd53
0000 e295fa80 - Busy Id=0004a38a From: 01c8.1708 Context=8021000d [e1061008 . e2966ea0]
Data: 00010001 00040242 0001a41a 000001c8 cb786dae 60cdf739
0000 e2966ea0 - Busy Id=0004d0c2 From: 04a8.04d4 Context=80050003 [e295fa80 . e1261008]
Data: 00020001 00030242 5415989d 3d8048c1 00070005 00000000
0000 e1261008 - Busy Id=000544d7 From: 01c8.0758 Context=8021000d [e2966ea0 . e1373530]
The LpcDataInfoChainHead queue contains 11 messages
Threads in RunDown queue : 0xe2364fc8 0xe297a940 0xe12cbce0 0xe1297270 0xe1534888 0xe2515ae0 0xe2927828 0xe1060e48 0xe295f8c0 0xe2966ce0 0xe1260e48
When a client thread sends a message to a server the message ID is stored in the clients ETHREAD.LpcReplyMessageId field. !lpc message searches the message ID in all threads in the system. Once client thread is identified ETHREAD.LpcWaitingOnPort will give the client communication port used for the LPC transaction. This has references to the server communication port and connection port.
!lpc message dumps the message queue for all the port objects. Important thing to note here is that even though the message queue is empty for connection port there are 11 messages chained to LpcDataInfoChainHead. This tells us that 11 messages were dequeued from the server message queue but the server has not replied to any of the messages since once the server responds to a message it is unchained from the LpcDataInfoChainHead list and freed.
Figure  LPC Message Queuing stages
The above figure shows messages are first queued to the MsgQueue and then chained to LpcDataInfoChainHead. This kind of chaining only happens when the message contains extra data. This does not happen for a LPC_CONNECTION_REQUEST or if the message does not contain extra data. In our scenario server thread has dequeued and started processing the hence we see it in the LpcDataInfoChainHead.
We now dump the server thread to see why it has not responded to the message for 12mins.
0: kd> .process /p /r 896e1450
Implicit process is now 896e1450
0: kd> !thread 89531020
THREAD 89531020 Cid 03f4.10a0 Teb: 7ffac000 Win32Thread: 00000000 WAIT: (WrLpcReply) UserMode Non-Alertable
8953120c Semaphore Limit 0x1
Waiting for reply to LPC MessageId 0003923d:
Current LPC port e22d29d8
Impersonation token: e15315d0 (Level Impersonation)
Owning Process 896e1450 Image: svchost1.exe
Wait Start TickCount 144629 Ticks: 2027 (0:00:00:31.671)
Context Switch Count 219
Win32 Start Address 0x00039236
LPC Server thread working on message Id 39236
Stack Init b8d07000 Current b8d06c08 Base b8d07000 Limit b8d04000 Call 0
Priority 10 BasePriority 8 PriorityDecrement 0
b8d06c20 8083e6a2 89531098 89531020 895310c8 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b8d06c4c 8083f164 8953120c 895311e0 89531020 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b8d06c94 8093983f 8953120c 00000011 b8d06d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b8d06d50 80834d3f 0000056c 000f3fc0 000f3fc0 nt!NtRequestWaitReplyPort+0x776 (FPO: [Non-Fpo])
b8d06d50 7c82ed54 0000056c 000f3fc0 000f3fc0 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b8d06d64)
0097f0a8 7c821c94 77c72700 0000056c 000f3fc0 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0097f0ac 77c72700 0000056c 000f3fc0 000f3fc0 ntdll!NtRequestWaitReplyPort+0xc (FPO: [3,0,0])
0097f118 77c713ba 0097f154 0097f138 77c72c7f RPCRT4!LRPC_CCALL::SendReceive+0x230 (FPO: [Non-Fpo])
0097f124 77c72c7f 0097f154 76d320d8 0097f540 RPCRT4!I_RpcSendReceive+0x24 (FPO: [Non-Fpo])
0097f138 77ce219b 0097f180 000db5f8 00000000 RPCRT4!NdrSendReceive+0x2b (FPO: [Non-Fpo])
0097f520 76d41213 76d320d8 76d402c8 0097f540 RPCRT4!NdrClientCall2+0x22e (FPO: [Non-Fpo])
0097f538 76d3fc12 000b5618 000ce248 000ce268 rpcss!LocalCreateInstance+0x1c (FPO: [Non-Fpo])
0097f5cc 76d3f033 0097f7d8 0097f654 0097f824 rpcss!CServerListEntry::CallServer+0x127 (FPO: [Non-Fpo])
0097f604 76d3bc7f 00000001 00000008 00000000 rpcss!CServerTableEntry::CallRunningServer+0x181 (FPO: [Non-Fpo])
0097f670 76d3bac5 0097f7d8 00000000 000e415c rpcss!Activation+0x337 (FPO: [Non-Fpo])
0097f6cc 76d3dd08 000e415c 0097f7bc 0097f71c rpcss!ActivateFromProperties+0x213 (FPO: [Non-Fpo])
0097f6dc 76d3dd5c 76d738d0 00000000 000e415c rpcss!CScmActivator::CreateInstance+0x10 (FPO: [Non-Fpo])
0097f71c 76d3c162 000e415c 00000000 0097f7bc rpcss!ActivationPropertiesIn::DelegateCreateInstance+0xf7 (FPO: [Non-Fpo])
0097f764 76d3bdb6 000e415c 0097f7bc 00000000 rpcss!ActivateFromPropertiesPreamble+0x4e3 (FPO: [Non-Fpo])
0097f7ac 76d3dcdd 00000002 000ce2ac 00000000 rpcss!PerformScmStage+0xbb (FPO: [Non-Fpo])
0097f8c8 77c70f3b 000cdbb0 000ce248 000ce268 rpcss!SCMActivatorCreateInstance+0xa2 (FPO: [Non-Fpo])
0097f8f8 77ce23f7 76d3dc3e 0097fae0 00000007 RPCRT4!Invoke+0x30
0097fcf8 77ce26ed 00000000 00000000 000cdcdc RPCRT4!NdrStubCall2+0x299 (FPO: [Non-Fpo])
0097fd14 77c709be 000cdcdc 00094ef0 000cdcdc RPCRT4!NdrServerCall2+0x19 (FPO: [Non-Fpo])
0097fd48 77c7093f 76d3156c 000cdcdc 0097fdec RPCRT4!DispatchToStubInCNoAvrf+0x38 (FPO: [Non-Fpo])
0097fd9c 77c70865 00000004 00000000 76d73538 RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x117 (FPO: [Non-Fpo])
0097fdc0 77c734b1 000cdcdc 00000000 76d73538 RPCRT4!RPC_INTERFACE::DispatchToStub+0xa3 (FPO: [Non-Fpo])
0097fdfc 77c71bb3 000b44d8 0008f0e8 000dea40 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x42c (FPO: [Non-Fpo])
0097fe20 77c75458 0008f120 0097fe38 000b44d8 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x127 (FPO: [Non-Fpo])
0097ff84 77c5778f 0097ffac 77c5f7dd 0008f0e8 RPCRT4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430 (FPO: [Non-Fpo])
0097ff8c 77c5f7dd 0008f0e8 00000000 00000000 RPCRT4!RecvLotsaCallsWrapper+0xd (FPO: [Non-Fpo])
0097ffac 77c5de88 00084298 0097ffec 77e6608b RPCRT4!BaseCachedThreadRoutine+0x9d (FPO: [Non-Fpo])
0097ffb8 77e6608b 000c0070 00000000 00000000 RPCRT4!ThreadStartRoutine+0x1b (FPO: [Non-Fpo])
0097ffec 00000000 77c5de6d 000c0070 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])
The server thread is waiting for a reply to another LPC message it issued while servicing the client request. Tracing the LPC message as above -
0: kd> !lpc message 0003923d
Searching message 3923d in threads ...
Client thread 89531020 waiting a reply from 3923d
Server thread 898cd998 is working on message 3923d
Searching thread 89531020 in port rundown queues ...
Server communication port 0xe22d58c8
Connected port: 0xe22d29d8 Server connection port: 0xe17a95f8
Client communication port 0xe22d29d8
Handles: 1 References: 15
Server connection port e17a95f8 Name: OLE71892D01F993498EA1C3757C71D5
Handles: 1 References: 37
Server process : 89bedd88 (svchost.exe)
Queue semaphore : 89718210
0000 e188c3b0 - Busy Id=00039239 From: 03f4.111c Context=80030001 [e17a9678 . e28b00f8]
Data: 00000001 00040342 0000d401 000004a8 2e6c8db6 4131cb31
0000 e28b00f8 - Busy Id=0003923d From: 03f4.10a0 Context=80030001 [e188c3b0 . e2846008]
0000 e2846008 - Busy Id=0003c96e From: 03f4.0df0 Context=80030001 [e28b00f8 . e27dcb70]
0000 e27dcb70 - Busy Id=0003db3c From: 03f4.1158 Context=80030001 [e2846008 . e28c0788]
Data: 00000001 00040342 0000d401 000004a8 2e6c8db6 4131cb31
0000 e28c0788 - Busy Id=0003db41 From: 03f4.0fbc Context=80030001 [e27dcb70 . e2983650]
0000 e2983650 - Busy Id=0003db5c From: 03f4.0f40 Context=80030001 [e28c0788 . e2596ae0]
0000 e2596ae0 - Busy Id=00046180 From: 03f4.03d8 Context=80030001 [e2983650 . e28c3ea0]
0000 e28c3ea0 - Busy Id=0004a38d From: 03f4.07c4 Context=80030001 [e2596ae0 . e294aea0]
0000 e294aea0 - Busy Id=0004d0c6 From: 03f4.1714 Context=80030001 [e28c3ea0 . e2977ea0]
Data: 00000001 00030342 0000d401 000004a8 2e6c8db6 4131cb31
0000 e2977ea0 - Busy Id=000544d8 From: 03f4.162c Context=80030001 [e294aea0 . e241f008]
0000 e241f008 - Busy Id=0006077e From: 03f4.031c Context=80030001 [e2977ea0 . e2940820]
0000 e2940820 - Busy Id=000638da From: 03f4.1754 Context=80030001 [e241f008 . e2576270]
0000 e2576270 - Busy Id=000638f1 From: 03f4.1794 Context=80030001 [e2940820 . e261c268]
0000 e261c268 - Busy Id=000638fe From: 03f4.17a8 Context=80030001 [e2576270 . e17a9678]
The LpcDataInfoChainHead queue contains 14 messages
Threads in RunDown queue : 0xe188c1f0 0xe28aff38 0xe2845e48 0xe27dc9b0 0xe28c05c8 0xe2983490 0xe2596920 0xe28c3ce0 0xe294ace0 0xe2977ce0 0xe241ee48 0xe2940660 0xe25760b0 0xe261c0a8
0: kd> .process /p /r 89bedd88
Implicit process is now 89bedd88
0: kd> !thread 898cd998
THREAD 898cd998 Cid 04a8.0508 Teb: 7ffd5000 Win32Thread: e23a6008 WAIT: (UserRequest) UserMode Non-Alertable
88eb89e0: (0006,0094) Flags: 00000800 Mdl: 00000000
Owning Process 89bedd88 Image: svchost.exe
Context Switch Count 7330 LargeStack
Win32 Start Address 0x0003923d
LPC Server thread working on message Id 3923d
Stack Init b9a55000 Current b9a54c60 Base b9a55000 Limit b9a51000 Call 0
b9a54c78 8083e6a2 898cda10 898cd998 898cda40 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b9a54ca4 8083f164 00000000 00000000 00000000 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b9a54cec 8092db70 89006340 00000006 80834d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b9a54d50 80834d3f 00000d64 00000000 00000000 nt!NtWaitForSingleObject+0x9a (FPO: [Non-Fpo])
b9a54d50 7c82ed54 00000d64 00000000 00000000 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b9a54d64)
0109f248 7c822124 77e6bad8 00000d64 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0109f24c 77e6bad8 00000d64 00000000 00000000 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])
0109f2bc 77e6ba42 00000d64 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac (FPO: [Non-Fpo])
0109f2d0 77719486 00000d64 ffffffff 00000000 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
0109f2e4 7769c09b 00086803 00000000 00086660 ole32!CComApartment::WaitForAccess+0x5c (FPO: [0,0,0])
0109f2f8 776a417d 0109f344 02163088 02145ac0 ole32!CComApartment::GetPreRegMOID+0x41 (FPO: [Non-Fpo])
0109f310 776a40d2 0109f344 77790858 00000001 ole32!GetPreRegMOID+0x24 (FPO: [Non-Fpo])
0109f358 776a48e3 00000001 00000003 0109f3c4 ole32!CIDObject::GetOrCreateStdID+0x11e (FPO: [Non-Fpo])
0109f37c 776a4419 02163088 00000000 00086860 ole32!ObtainStdIDFromUnk+0x92 (FPO: [Non-Fpo])
0109f3a4 776a43c1 0109f468 00162690 02b8bd30 ole32!StdMarshalObject+0x49 (FPO: [Non-Fpo])
0109f400 776a3974 0109f42c 0109f468 00162690 ole32!CDestObjectWrapper::MarshalInterface+0x150 (FPO: [Non-Fpo])
0109f43c 77686f83 0109f468 00162690 02b8bd30 ole32!CoMarshalInterface+0x93 (FPO: [Non-Fpo])
0109f490 77686e38 02b8bd30 00162690 00162690 ole32!MarshalHelperMulti+0x48 (FPO: [Non-Fpo])
0109f4bc 77686d46 58b10040 00000001 00162690 ole32!GetInstanceHelperMulti+0x4d (FPO: [Non-Fpo])
0109f584 77c70f3b 000a7938 00000000 0218b774 ole32!CObjServer::CreateInstance+0x30d (FPO: [Non-Fpo])
0109f5a8 77ce23f7 77686b34 0109f790 00000004 RPCRT4!Invoke+0x30
0109f9a8 77ce1a95 000b4398 000b2074 05cc596c RPCRT4!NdrStubCall2+0x299 (FPO: [Non-Fpo])
0109fa00 77788ec1 000b4398 05cc596c 000b2074 RPCRT4!CStdStubBuffer_Invoke+0xc6 (FPO: [Non-Fpo])
0109fa44 77788e74 05cc596c 000b4428 00000000 ole32!SyncStubInvoke+0x33 (FPO: [Non-Fpo])
0109fa8c 7769b2b7 05cc596c 000b4288 000b4398 ole32!StubInvoke+0xa7 (FPO: [Non-Fpo])
0109fb64 7769b0b5 000b2074 00000000 000b4398 ole32!CCtxComChnl::ContextInvoke+0xe3 (FPO: [Non-Fpo])
0109fb80 77789926 05cc596c 00000001 000b4398 ole32!MTAInvoke+0x1a (FPO: [Non-Fpo])
0109fbb0 77788c33 d0908070 000b2074 000b4398 ole32!AppInvoke+0xa3 (FPO: [Non-Fpo])
0109fc84 77789936 05cc5918 000b22f8 000b4410 ole32!ComInvokeWithLockAndIPID+0x2c5 (FPO: [Non-Fpo])
0109fcd0 77c709be 020ccc24 000b4410 020ccc24 ole32!ThreadInvoke+0x1e0 (FPO: [Non-Fpo])
0109fd04 77c7093f 777887e0 020ccc24 0109fdec RPCRT4!DispatchToStubInCNoAvrf+0x38 (FPO: [Non-Fpo])
0109fd58 77c70865 00000000 00000000 7767bfa8 RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x117 (FPO: [Non-Fpo])
0109fd7c 77c657eb 020ccc24 00000000 7767bfa8 RPCRT4!RPC_INTERFACE::DispatchToStub+0xa3 (FPO: [Non-Fpo])
0109fdbc 77c71e26 020ccc24 020ccbdc 00000000 RPCRT4!RPC_INTERFACE::DispatchToStubWithObject+0xc0 (FPO: [Non-Fpo])
0109fdfc 77c71bb3 000c7db0 000b1498 00097c08 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x42c (FPO: [Non-Fpo])
0109fe20 77c75458 000b14d0 0109fe38 000c7db0 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x127 (FPO: [Non-Fpo])
0109ff84 77c5778f 0109ffac 77c5f7dd 000b1498 RPCRT4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430 (FPO: [Non-Fpo])
0109ff8c 77c5f7dd 000b1498 00000000 00000000 RPCRT4!RecvLotsaCallsWrapper+0xd (FPO: [Non-Fpo])
0109ffac 77c5de88 0008aa48 0109ffec 77e6608b RPCRT4!BaseCachedThreadRoutine+0x9d (FPO: [Non-Fpo])
0109ffb8 77e6608b 000b1678 00000000 00000000 RPCRT4!ThreadStartRoutine+0x1b (FPO: [Non-Fpo])
The above LPC server thread is waiting for allocation of COM OID (Object Identifier) since we have run out of OIDs. The way OIDs are allocated is by calling CComApartment::CallTheResolver( ). Searching threads in the system with CComApartment::CallTheResolver on the stack we get the below thread within svchost.exe.
0: kd> !thread 88d76db0
THREAD 88d76db0 Cid 04a8.13f4 Teb: 7ffd6000 Win32Thread: e295c758 WAIT: (Unknown) UserMode Non-Alertable
88d76f9c Semaphore Limit 0x1
Waiting for reply to LPC MessageId 0003832f:
Current LPC port e15791b8
Owning Process 89bedd88 Image: svchost.exe
Wait Start TickCount 99073 Ticks: 47583 (0:00:12:23.484)
Context Switch Count 6947 LargeStack
Win32 Start Address 0x0003832e
LPC Server thread working on message Id 3832e
Stack Init b8af6000 Current b8af5c08 Base b8af6000 Limit b8af2000 Call 0
Priority 9 BasePriority 8 PriorityDecrement 0
b8af5c20 8083e6a2 88d76e28 88d76db0 88d76e58 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b8af5c4c 8083f164 88d76f9c 88d76f70 88d76db0 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b8af5c94 8093983f 88d76f9c 00000011 b8af5d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b8af5d50 80834d3f 00000094 000dde10 000dde10 nt!NtRequestWaitReplyPort+0x776 (FPO: [Non-Fpo])
b8af5d50 7c82ed54 00000094 000dde10 000dde10 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b8af5d64)
0251ed2c 7c821c94 77c72700 00000094 000dde10 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0251ed30 77c72700 00000094 000dde10 000dde10 ntdll!NtRequestWaitReplyPort+0xc (FPO: [3,0,0])
0251ed88 77c713ba 0251edc4 0251eda8 77c72c7f RPCRT4!LRPC_CCALL::SendReceive+0x230 (FPO: [Non-Fpo])
0251ed94 77c72c7f 0251edc4 776720f8 0251f1b0 RPCRT4!I_RpcSendReceive+0x24 (FPO: [Non-Fpo])
0251eda8 77ce219b 0251edf0 05c6b44c 0251f360 RPCRT4!NdrSendReceive+0x2b (FPO: [Non-Fpo])
0251f190 7769c087 776720f8 77671df0 0251f1b0 RPCRT4!NdrClientCall2+0x22e (FPO: [Non-Fpo])
0251f1a8 7769c04e 0008b0a8 0008bc30 0251f1f4 ole32!ServerAllocateOIDs+0x1c (FPO: [Non-Fpo])
0251f1d8 7769c01b 0251f1f4 0251f368 0251f220 ole32!CRpcResolver::ServerAllocOIDs+0x2a (FPO: [Non-Fpo])
0251f1fc 7769bf93 0251f368 0251f220 00000014 ole32!OXIDEntry::AllocOIDs+0x40 (FPO: [Non-Fpo])
0251f36c 7769c0b2 00086803 00000000 00086660 ole32!CComApartment::CallTheResolver+0x6b (FPO: [Non-Fpo])
0251f380 776a417d 0251f3cc 05c67490 05d59ee0 ole32!CComApartment::GetPreRegMOID+0x54 (FPO: [Non-Fpo])
0251f398 776a40d2 0251f3cc 77790858 00000001 ole32!GetPreRegMOID+0x24 (FPO: [Non-Fpo])
0251f3e0 776a48e3 00000001 00000003 0251f44c ole32!CIDObject::GetOrCreateStdID+0x11e (FPO: [Non-Fpo])
0251f404 776a4419 05c67490 00000000 00086860 ole32!ObtainStdIDFromUnk+0x92 (FPO: [Non-Fpo])
0251f42c 776a43c1 05e39b00 0251f504 02b30234 ole32!StdMarshalObject+0x49 (FPO: [Non-Fpo])
0251f488 776a3974 0251f4b4 05e39b00 0251f504 ole32!CDestObjectWrapper::MarshalInterface+0x150 (FPO: [Non-Fpo])
0251f4c4 77c544a3 05e39b00 0251f504 02b30234 ole32!CoMarshalInterface+0x93 (FPO: [Non-Fpo])
0251f518 77c54e85 0251f5e8 02b30234 74ce6210 RPCRT4!NdrpInterfacePointerMarshall+0x9a (FPO: [Non-Fpo])
0251f538 77c691ad 0251f5e8 000f2898 02b30234 RPCRT4!NdrpPointerMarshall+0x87 (FPO: [Non-Fpo])
0251f550 77c690fc 0251f5e8 02b30234 74ce6210 RPCRT4!NdrPointerMarshall+0x30 (FPO: [Non-Fpo])
0251f570 77c691ad 0251f5e8 00000000 00112e50 RPCRT4!NdrpPointerMarshall+0x116 (FPO: [Non-Fpo])
0251f588 77c715ac 0251f5e8 00112e50 74ce620c RPCRT4!NdrPointerMarshall+0x30 (FPO: [Non-Fpo])
0251f5ac 77ce249e 00000000 00000001 020ce8c8 RPCRT4!NdrpServerMarshal+0x82 (FPO: [Non-Fpo])
0251f9a8 77ce1a95 020ce8c8 0011fa08 000f1454 RPCRT4!NdrStubCall2+0x374 (FPO: [Non-Fpo])
0251fa00 77788ec1 020ce8c8 000f1454 0011fa08 RPCRT4!CStdStubBuffer_Invoke+0xc6 (FPO: [Non-Fpo])
0251fa44 77788e74 000f1454 00137868 020debf0 ole32!SyncStubInvoke+0x33 (FPO: [Non-Fpo])
0251fa8c 7769b2b7 000f1454 0014aaf8 020ce8c8 ole32!StubInvoke+0xa7 (FPO: [Non-Fpo])
0251fb64 7769b0b5 0011fa08 00000000 020ce8c8 ole32!CCtxComChnl::ContextInvoke+0xe3 (FPO: [Non-Fpo])
0251fb80 77789926 000f1454 00000001 020ce8c8 ole32!MTAInvoke+0x1a (FPO: [Non-Fpo])
0251fbb0 77788c33 d0908070 0011fa08 020ce8c8 ole32!AppInvoke+0xa3 (FPO: [Non-Fpo])
0251fc84 77789936 000f1400 020d3b48 00137850 ole32!ComInvokeWithLockAndIPID+0x2c5 (FPO: [Non-Fpo])
0251fcd0 77c709be 0216202c 00137850 0216202c ole32!ThreadInvoke+0x1e0 (FPO: [Non-Fpo])
0251fd04 77c7093f 777887e0 0216202c 0251fdec RPCRT4!DispatchToStubInCNoAvrf+0x38 (FPO: [Non-Fpo])
0251fd58 77c70865 00000000 00000000 7767bfa8 RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x117 (FPO: [Non-Fpo])
0251fd7c 77c657eb 0216202c 00000000 7767bfa8 RPCRT4!RPC_INTERFACE::DispatchToStub+0xa3 (FPO: [Non-Fpo])
0: kd> !lpc message 0003832f
Searching message 3832f in threads ...
Server thread 8900edb0 is working on message 3832f
Client thread 88d76db0 waiting a reply from 3832f
Searching thread 88d76db0 in port rundown queues ...
Server communication port 0xe188a6e8
Connected port: 0xe15791b8 Server connection port: 0xe13734b0
Client communication port 0xe15791b8
Handles: 1 References: 5
Data: 00010001 00040242 00000000 00000000 00000000 00000000
Length=0044002c Type=00380001 (LPC_REQUEST)
0: kd> !thread 8900edb0
THREAD 8900edb0 Cid 03f4.0598 Teb: 7ffd4000 Win32Thread: 00000000 WAIT: (Unknown) UserMode Non-Alertable
Context Switch Count 5892
Win32 Start Address 0x0003832f
LPC Server thread working on message Id 3832f
Stack Init b95d1000 Current b95d0c60 Base b95d1000 Limit b95ce000 Call 0
b95d0c78 8083e6a2 8900ee28 8900edb0 8900ee58 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b95d0ca4 8083f164 00000000 00000000 00000000 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b95d0cec 8092db70 89014cc0 00000006 80935d01 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b95d0d50 80834d3f 000004e8 00000000 00000000 nt!NtWaitForSingleObject+0x9a (FPO: [Non-Fpo])
b95d0d50 7c82ed54 000004e8 00000000 00000000 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b95d0d64)
0082f7a4 7c822124 77e6bad8 000004e8 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0082f7a8 77e6bad8 000004e8 00000000 00000000 ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])
0082f818 77e6ba42 000004e8 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac (FPO: [Non-Fpo])
0082f82c 77cab1e0 000004e8 ffffffff 000bc314 kernel32!WaitForSingleObject+0x12 (FPO: [Non-Fpo])
0082f840 77cab333 0082f89c 000ad626 000bc314 RPCRT4!SWMRLock::LockSharedOrExclusiveOnLastWriter+0x32 (FPO: [Non-Fpo])
0082f868 77c72e00 0082f89c 00000001 00000002 RPCRT4!SWMRLock::LockSharedOrExclusive+0xa6 (FPO: [Non-Fpo])
0082f890 77c72f71 00000000 00000000 00000010 RPCRT4!NDRSContextUnmarshall2+0x1b9 (FPO: [Non-Fpo])
0082f8bc 77c72eb6 4182f938 76d32c22 0082f938 RPCRT4!NdrServerContextNewUnmarshall+0xa6 (FPO: [Non-Fpo])
0082f8d0 77c70f93 0082f938 0082fae4 76d32c22 RPCRT4!NdrUnmarshallHandle+0x52 (FPO: [Non-Fpo])
0082f900 77ce2345 0082fae4 00094618 00000000 RPCRT4!NdrpServerUnMarshal+0x13d (FPO: [Non-Fpo])
0082fcf8 77ce26ed 00000000 00000000 000bbd9c RPCRT4!NdrStubCall2+0x19f (FPO: [Non-Fpo])
0082fd14 77c709be 000bbd9c 00094618 000bbd9c RPCRT4!NdrServerCall2+0x19 (FPO: [Non-Fpo])
0082fd48 77c7093f 76d3156c 000bbd9c 0082fdec RPCRT4!DispatchToStubInCNoAvrf+0x38 (FPO: [Non-Fpo])
0082fd9c 77c70865 00000005 00000000 76d736c0 RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x117 (FPO: [Non-Fpo])
0082fdc0 77c734b1 000bbd9c 00000000 76d736c0 RPCRT4!RPC_INTERFACE::DispatchToStub+0xa3 (FPO: [Non-Fpo])
0082fdfc 77c71bb3 00099188 0008f0e8 000adfd8 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x42c (FPO: [Non-Fpo])
0082fe20 77c75458 0008f120 0082fe38 00099188 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x127 (FPO: [Non-Fpo])
0082ff84 77c5778f 0082ffac 77c5f7dd 0008f0e8 RPCRT4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430 (FPO: [Non-Fpo])
0082ff8c 77c5f7dd 0008f0e8 00000000 00000000 RPCRT4!RecvLotsaCallsWrapper+0xd (FPO: [Non-Fpo])
0082ffac 77c5de88 00084298 0082ffec 77e6608b RPCRT4!BaseCachedThreadRoutine+0x9d (FPO: [Non-Fpo])
0082ffb8 77e6608b 000bc298 00000000 00000000 RPCRT4!ThreadStartRoutine+0x1b (FPO: [Non-Fpo])
0082ffec 00000000 77c5de6d 000bc298 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])
The above thread belongs to the RPC service and is waiting for a lock to be acquired.
0: kd> x /v svchost1!*service*name*
pub global 01004078 0 svchost1!ServiceNames = <no type information>
0: kd> du poi(01004078)
To find why this thread is not able to acquire the lock, investigation of the threads within the RPC service is required. The reason the lock is not available is because there is a thread that has acquired the lock and is trying to reach a client that does not exist.
The chain of events is summarized in the below figure. Note that all horizontal links were traversed using LPC data structures and debugger extensions. The vertical links were traversed by investigation beyond the scope of LPC. Thus the issue that started in a component of SMS was finally traced to RPC through the use of LPC data structures and debugger extension. LPC is one of the core inter-process communication mechanisms. Understanding LPC and learning to use its extensions helps a lot to debug issues where client threads are waiting for reply from the server.
Hello my name is Roy, I’m an EE on the Microsoft global escalation services / CPR team. This blog is a follow on to my first LPC blog. We will be discussing debugger extensions that allow you to look at LPC related issues.
Disclaimer: The purpose of this blog is to illustrate debugging techniques with LPC. Please do not rely on this information as documentation for the APIs when writing your own code. As always, this is subject to change in the future.
LPC Kernel Debugger Extensions
Display the list and description of all the !lpc commands
!lpc message [MessageId]
Display the message with a given ID and all related information by attempting to match the given Message ID to the EHTREAD->LpcReceivedMessageId and to the ETHREAD->LpcReplyMessageId of all threads in the system.
If the MessageId parameter is omitted then it attempts to display all the outstanding messages in the system by searching for the tag ‘LpcM’ in the pools.
!lpc port [PortAddress]
Displays port information. If a server connection port address is specified then only information about that port is displayed. If either a client or server communication port is specified it prints information about the specified communication port, the peer communication port and the server connection port.
If the PortAddress parameter is omitted then it attempts to walk the list of all objects of type “Port” and “WaitablePort” and display them. Note that for this feature to work the GFlags option “+otl” i.e. “Maintain a list of objects for each type” must be enabled.
!lpc scan PortAddress
Displays port information. It attempts to walk the list of all objects of type “Port” and “WaitablePort” and display the one matching the specified port address. Note that for this feature to work the GFlags option “+otl” i.e. “Maintain a list of objects for each type” must be enabled.
!lpc thread [ThreadAddr]
If ThreadAddr is specified it walks the list of threads in the ETHREAD-> LpcReplyChain to locate the list head i.e. a “Port” or “WaitablePort” object on which the thread is waiting for a reply.
If the ThreadAddr parameter is omitted then it attempts to find all LPC server threads by looking for threads with a non-NULL EHTREAD->LpcReceivedMessageId and all client threads by looking for threads with a non-NULL ETHREAD->LpcReplyMessageId and displays them.
Toggles a setting that controls whether the “lpc message” command will search for LPC message tag (‘LpcM’) in the kernel pools or not.
LPC Kernel Debugger Extension Usage
LPC Connection Port information from call stack
On the call stack of any client or server threads blocked on LPC data transfer or LPC connection there will be a frame containing either one of the functions NtRequestWaitReplyPort() or NtReplyWaitReceivePortEx() . The first parameter to either one of these functions is the handle to the port they are blocked on.
kd> !thread 810de2a8
THREAD 810de2a8 Cid 01dc.01f4 Teb: 7ffde000 Win32Thread: 00000000 WAIT: (WrLpcReceive) UserMode Non-Alertable
81131188 Semaphore Limit 0x7fffffff
Owning Process 810ddda0 Image: rpclpcs.exe
Wait Start TickCount 64555 Ticks: 402 (0:00:00:04.025)
Context Switch Count 2
Win32 Start Address 0x77e76bf0
Start Address 0x7c810856
Stack Init f8e28000 Current f8e27c4c Base f8e28000 Limit f8e25000 Call 0
Priority 8 BasePriority 8 PriorityDecrement 0 DecrementCount 0
f8e27c64 804dc6a6 810de318 810de2a8 804dc6f2 nt!KiSwapContext+0x2e (FPO: [Uses EBP] [0,0,4])
f8e27c70 804dc6f2 e1084108 8055a540 e1084108 nt!KiSwapThread+0x46 (FPO: [0,0,0])
f8e27c98 8056a50a 00000001 00000010 00000001 nt!KeWaitForSingleObject+0x1c2 (FPO: [Non-Fpo])
f8e27d48 804df06b 000007c4 002bff70 00000000 nt!NtReplyWaitReceivePortEx+0x3dc (FPO: [Non-Fpo])
f8e27d48 7c90eb94 000007c4 002bff70 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f8e27d64)
002bff80 00000000 00000000 00000000 00000000 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
kd> !handle 7c4 3 810ddda0
processor number 0, process 810ddda0
PROCESS 810ddda0 SessionId: 0 Cid: 01dc Peb: 7ffd9000 ParentCid: 01b4
DirBase: 058cd000 ObjectTable: e1a13278 HandleCount: 18.
Handle table at e107d000 with 18 Entries in use
07c4: Object: e1084108 GrantedAccess: 001f0001 Entry: e107df88
Object: e1084108 Type: (812b5c80) Port
ObjectHeader: e10840f0 (old version)
HandleCount: 1 PointerCount: 4
Directory Object: e14c72c8 Name: rpclpc
kd> !lpc port e1084108
Server connection port e1084108 Name: rpclpc
Handles: 1 References: 4
Server process : 810ddda0 (rpclpcs.exe)
Queue semaphore : 81131188
LPC Messages that are waiting to be picked up by the server thread
When a message is queued to the server connection port and the server thread has not yet not been signaled to pull out the message the ‘!lpc port <Port>” displays the following output that includes the message.
kd> !lpc port e10d8388
Client communication port 0xe10d8388
Handles: 1 References: 2
Connected port: 0xe106dc00 Server connection port: 0xe15512e0
Server communication port 0xe106dc00
Server connection port e15512e0 Name: rpclpc
Server process : ffbbebe8 (rpclpcs.exe)
Queue semaphore : 81250848
Messages in queue:
0000 e10513f8 - Busy Id=000048d8 From: 05f4.0108 Context=80020000 [e15512f0 . e15512f0]
Length=0054003c Type=00000001 (LPC_REQUEST)
Data: 00000001 00000241 00000000 00000000 f877bc04 804ec10e
The message queue contains 1 messages
LPC Messages being processed by the server
When an LPC message has been queued by the client, dequeued by the sever and being worked upon by the server both the client and the server thread have the message ID associated with them
kd> !lpc message 16fa
Searching message 16fa in threads ...
Server thread 810de2a8 is working on message 16fa
Client thread 810dc930 waiting a reply from 16fa
Searching thread 810dc930 in port rundown queues ...
Handles: 1 References: 3
kd> !thread 810de2a8
THREAD 810de2a8 Cid 01dc.01f4 Teb: 7ffde000 Win32Thread: 00000000 RUNNING on processor 0
Owning Process 810ddda0 Image: rpclpcs.exe
Wait Start TickCount 65732 Ticks: 10 (0:00:00:00.100)
Context Switch Count 4
Win32 Start Address 0x000016fa
LPC Server thread working on message Id 16fa
Start Address kernel32!BaseThreadStartThunk (0x7c810856)
Stack Init f8e28000 Current f8e27728 Base f8e28000 Limit f8e25000 Call 0
Priority 9 BasePriority 8 PriorityDecrement 0 DecrementCount 0
f8e27d64 7c90eb94 badb0d00 002bfe24 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ f8e27d64)
002bfe18 7c90e399 77e76703 000007c4 002bff70 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
002bfe1c 77e76703 000007c4 002bff70 00000000 ntdll!NtReplyWaitReceivePortEx+0xc (FPO: [5,0,0])
002bff80 77e76c22 002bffa8 77e76a3b 00084540 RPCRT4!LRPC_ADDRESS::ReceiveLotsaCalls+0xf4 (FPO: [Non-Fpo])
002bff88 77e76a3b 00084540 7c90ee18 0006fb10 RPCRT4!RecvLotsaCallsWrapper+0xd (FPO: [Non-Fpo])
002bffa8 77e76c0a 00083f20 002bffec 7c80b50b RPCRT4!BaseCachedThreadRoutine+0x79 (FPO: [Non-Fpo])
002bffb4 7c80b50b 00084718 7c90ee18 0006fb10 RPCRT4!ThreadStartRoutine+0x1a (FPO: [Non-Fpo])
002bffec 00000000 77e76bf0 00084718 00000000 kernel32!BaseThreadStart+0x37 (FPO: [Non-Fpo])
kd> !thread 810dc930
THREAD 810dc930 Cid 01e4.01ec Teb: 7ffde000 Win32Thread: 00000000 READY
Waiting for reply to LPC MessageId 000016fa:
Pending LPC Reply Message:
Owning Process 810e9860 Image: rpclpcc.exe
Wait Start TickCount 65731 Ticks: 11 (0:00:00:00.110)
Context Switch Count 26
Win32 Start Address rpclpcs!pre_c_init (0x01001a91)
Start Address kernel32!BaseProcessStartThunk (0x7c810867)
Stack Init fbe0e000 Current fbe0dc28 Base fbe0e000 Limit fbe0b000 Call 0
Priority 8 BasePriority 8 PriorityDecrement 0 DecrementCount 16
fbe0dc40 804ea3a4 00000000 810dc930 e107f388 nt!KiUnlockDispatcherDatabase+0x77 (FPO: [Uses EBP] [0,0,4])
fbe0dc58 80586255 81131188 00000001 00000001 nt!KeReleaseSemaphore+0x70 (FPO: [Non-Fpo])
fbe0dd10 80598c58 000847b0 0006f8c0 0006f8c8 nt!NtSecureConnectPort+0x635 (FPO: [Non-Fpo])
fbe0dd3c 804df06b 000847b0 0006f8c0 0006f8c8 nt!NtConnectPort+0x24 (FPO: [Non-Fpo])
fbe0dd3c 7c90eb94 000847b0 0006f8c0 0006f8c8 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ fbe0dd64)
0006f8d0 7c93040b ffffffff 00000000 7c859e48 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0006f9b0 7c8107fd 00000000 00084718 7c90e642 ntdll!DbgPrint+0x1b (FPO: [Non-Fpo])
0006fe68 77ea11a8 00000022 00084540 00000000 kernel32!CreateRemoteThread+0x284 (FPO: [Non-Fpo])
000847a8 00000000 00000001 00000000 00000000 RPCRT4!LRPC_ADDRESS::CompleteListen+0x84 (FPO: [Non-Fpo])
Finding all threads in the system involved in LPC communication
To get a consolidated list of all client threads in the system waiting on LPC replies and a list of server threads working on received LPC requests use the “!lpc threads” command without any parameters
0: kd> !lpc thread
Searching message 0 in threads ...
Client thread 89a4cdb0 waiting a reply from 15f6
Server thread 898c6a50 is working on message 63f09
Server thread 89671020 is working on message 63f75
Server thread 89904a48 is working on message 63f10
Server thread 88d6bdb0 is working on message 3d6b1
Server thread 89973db0 is working on message 63f32
Server thread 896d32b8 is working on message 3454
Server thread 8995d020 is working on message 63f76
Server thread 8960f020 is working on message 63f77
Server thread 898cd350 is working on message 61bd9
Server thread 8900fbf0 is working on message 33c2e
Server thread 88d539a8 is working on message 3343f
Client thread 89be4020 waiting a reply from 6077e
Client thread 89012990 waiting a reply from 39239
Server thread 89012990 is working on message 39232
Client thread 88d13b40 waiting a reply from 3c96e
8 No thread in the system is doing a NtListenPort( ) does that mean no LPC connection port is open for connections?
!stacks 2 NtListenPort does not return any thread because NtListenPort( ) calls NtRequestWaitReplyPort( ) to wait for an incoming LPC_CONNECTION_REQUEST message. Instead of calling NtListenPort( ) for listening for incoming connections most LPC connection threads directly call NtRequestWaitReplyPort( ).
8 !lpc message does not show any message?
!poolfind LpcM 1 should do the trick.
0: kd> !lpc message
Scanning large pool allocation table for Tag: LpcM (8974e000 : 8976e000)
Searching Paged pool (e1000000 : f7000000) for Tag: LpcM
0: kd> !poolfind LpcM 1
e1061000 size: 168 previous size: 0 (Allocated) LpcM
e1090758 size: 168 previous size: 8 (Allocated) LpcM
e10a7a40 size: 168 previous size: 8 (Allocated) LpcM
e1188468 size: 8 previous size: 20 (Free) LpcM
e1261000 size: 168 previous size: 0 (Allocated) LpcM
e1297428 size: 168 previous size: 8 (Allocated) LpcM
Once you have the pool blocks listed with the “LpcM” tag dump the ones that are “Allocated”. On a 32-bit machine add 8bytes to the pool address listed on the left most column. On a 64-bit machine add 16bytes i.e. 0x10. Dump the message as below to retrieve the message id.
0: kd> dt nt!_LPCP_MESSAGE e1061000+8
+0x000 Entry : _LIST_ENTRY [ 0xe295fa80 - 0xe29279e8 ]
+0x000 FreeEntry : _SINGLE_LIST_ENTRY
+0x004 Reserved0 : 0xe29279e8
+0x008 SenderPort : 0xe2327020
+0x00c RepliedToThread : (null)
+0x010 PortContext : 0x8021000d
+0x018 Request : _PORT_MESSAGE
0: kd> dt nt!_LPCP_MESSAGE e1061000+8 Request.
+0x018 Request :
+0x000 u1 : __unnamed
+0x004 u2 : __unnamed
+0x008 ClientId : _CLIENT_ID
+0x008 DoNotUseThisField : 9.8460604703041575e-311
+0x010 MessageId : 0x4617f
+0x014 ClientViewSize : 0
+0x014 CallbackId : 0
0: kd> !lpc message 0x4617f
Searching message 4617f in threads ...
Server thread 88f91c08 is working on message 4617f
Client thread 88f0f5f8 waiting a reply from 4617f
Searching thread 88f0f5f8 in port rundown queues ...
8 The number of server threads shown by !lpc thread processing requests is much more than the number of client threads waiting for reply.
!lpc thread interprets stale data in ETHREAD structure as valid LPC information. An indication of the stale data is the ETHREAD.LpcExitThreadCalled will be set. This field is only set when the thread is exiting.
Server thread 89671020 is working on message 63f75
Server thread 89973db0 is working on message 63f32
Server thread 8960f020 is working on message 63f77
Server thread 88d539a8 is working on message 3343f
0: kd> !lpc message 3454
Searching message 3454 in threads ...
0: kd> dt nt!_ETHREAD 896d32b8 Lpc.
+0x1c0 LpcReplyChain : [ 0x896d3478 - 0x896d3478 ]
+0x000 Flink : 0x896d3478 _LIST_ENTRY [ 0x896d3478 - 0x896d3478 ]
+0x004 Blink : 0x896d3478 _LIST_ENTRY [ 0x896d3478 - 0x896d3478 ]
+0x1ec LpcReplySemaphore :
+0x000 Header : _DISPATCHER_HEADER
+0x010 Limit : 1
+0x200 LpcReplyMessage :
+0x200 LpcWaitingOnPort :
+0x220 LpcReceivedMessageId : 0x3454
+0x234 LpcReplyMessageId : 0
+0x248 LpcReceivedMsgIdValid : 0x1 ''
+0x248 LpcExitThreadCalled : 0x1 ''
8 Finding all client threads in the system waiting for LPC reply.
Hello, my name is Roy. I’m an Escalation Engineer in the CPR platforms team. I’ll be doing a four part series on LPC over the coming month. You’re sure to find this interesting. That being said let’s get started.
Disclaimer: The purpose of this blog is to illustrate debugging techniques with LPC. Please do not rely on this information as documentation for the LPC APIs when writing your own code. As always, this is subject to change in the future.
LPC (Local Inter-Process Communication and NOT Local Procedure Calls) is a high speed message based communication mechanism implemented in the NT kernel. LPC can be used for communication between two user mode processes, between a user mode process and a kernel mode driver or between two kernel mode drivers. One example would be two user mode processes communicating via LPC. Like CSRSS.exe talking to SMSS.exe over SmssWinStationApiPort while creating a logon session or any process talking to LSASS.exe over LsaAuthenticationPort for security reasons. Another example of a user mode process communicating with a kernel mode driver would be KSecDD.sys talking to LSASS.exe for EFS key encryption and decryption during read/write of an encrypted file.
LPC uses two different mechanisms for passing data between the client and the server process. It uses the LPC message buffer (for data size less than 304 bytes) or it uses a shared memory Section mapped to both client and server address space (for data size more than 304 bytes).
Apart from being used as the protocol of choice of Remote Procedure Calls between processes running on the same system, LPC is also used throughout the system e.g. for Win32 Application’s communication with CSRSS.exe, Security Reference Monitor’s communication with LSASS, WinLogon’s communication with LSASS etc.
LPC enforces synchronous communication model between the client and the server processes. Vista deprecates the use of LPC using a new mechanism called Asynchronous Local Inter-Process Communication (ALPC). ALPC has an inherent advantage over LPC in that all calls from client to the server are asynchronous i.e. the client does not need to block/wait for the server to respond to a message. In Vista, legacy application calls to LPC APIs are automatically redirected to the newer ALPC APIs.
LPC APIs are native APIs i.e. they are exported in user mode by NTDLL.dll and in kernel mode by NTOSKRNL.exe. LPC APIs are not exposed at the Win32 level hence win32 applications cannot use the LPC facility directly. Win32 applications can however use LPC indirectly when using RPC by specifying LPC as its underling transport via the protocol sequence “ncalrpc”. All LPC APIs end in the word "Port" which implies an LPC communication endpoint.
Used by server to create a connection port
Used by client to connect to a connection port
Used by server to listen for connection requests on the connection port.
Used by server to accept connection requests on the connection port
Used by server to complete the acceptance of a connection request
Used to send a datagram message that does not have a reply
Used to send a message and wait for a reply
Used to send a reply to a particular message
Used to send a reply to a particular message and wait for a reply to a previous message
Used by server to send a reply to the client and wait to receive a message from the client
Used by server thread to temporarily borrow the security context of a client thread
The following diagram illustrates the steps taken by an LPC server process to listen to inbound connection request from potential client and the steps taken by clients to connect to listening servers.
Figure  LPC Client Server Connection Establishment Sequence
NOTE: Many server processes use the NtReplyWaitReceivePort( ) API instead of NtListenPort( ). NtListenPort( ) drops all LPC messages except connection requests. Hence NtListenPort( ) can only be used for the first connection. For later connection requests NtReplyWaitReceivePort( ) is used.
The following diagram illustrates the steps taken by an LPC client to send a request to an LPC server that is has already established a connection to and the steps taken by the server to responds to the message.
Figure  LPC Client Server Data Transfer Sequence
LPC Data Structures
LPC Port Data Structure
LPC endpoints are referred to as ports. LPC implementation uses the same port structure to represent the various types of ports. The ports used by LPC are Server Connection Ports which are named ports created by the server process to accept incoming connections from clients. Client Communication Ports which are created by the client process to connect to a server process and Server Communication Port created by the server process.
Figure  LPC Port types and their relationships
LPCP_PORT_OBJECT is the internal data structure used by LPC to represent a LPC port. LPCP_PORT_OBJECTs are allocated out of paged pool with the tag ‘Port’.
kd> dt nt!_LPCP_PORT_OBJECT
+0x000 ConnectionPort : Ptr32 _LPCP_PORT_OBJECT
+0x004 ConnectedPort : Ptr32 _LPCP_PORT_OBJECT
+0x008 MsgQueue : _LPCP_PORT_QUEUE
+0x018 Creator : _CLIENT_ID
+0x020 ClientSectionBase : Ptr32 Void
+0x024 ServerSectionBase : Ptr32 Void
+0x028 PortContext : Ptr32 Void
+0x02c ClientThread : Ptr32 _ETHREAD
+0x030 SecurityQos : _SECURITY_QUALITY_OF_SERVICE
+0x03c StaticSecurity : _SECURITY_CLIENT_CONTEXT
+0x078 LpcReplyChainHead : _LIST_ENTRY
+0x080 LpcDataInfoChainHead : _LIST_ENTRY
+0x088 ServerProcess : Ptr32 _EPROCESS
+0x088 MappingProcess : Ptr32 _EPROCESS
+0x08c MaxMessageLength : Uint2B
+0x08e MaxConnectionInfoLength : Uint2B
+0x090 Flags : Uint4B
+0x094 WaitEvent : _KEVENT
Points to the Server Communication Port
Points to the Server Connection Port
Used to signal the server thread about the presence of a message in MsgQueue.ReceiveHead
Head of a doubly linked list containing all the messages that are waiting to be dequeued by the server.
Points to the LPCP_NONPAGED_PORT_QUEUE structure for the client communication port for tracking lost replies.
Head of a doubly linked list containing all the threads that are waiting for replies to messages sent to this port.
LPC Message Data Structure
LPC message are data structures that carry information from the LPC client to the LPC server and can be of various types like connection, request, close etc.
LPCP_MESSAGE is the internal data structure used by LPC to represent a message. LPCP_MESSAGE structures are allocated from a system wide lookaside list with the tag ‘LpcM’.
kd> dt nt!_LPCP_MESSAGE
+0x000 Entry : _LIST_ENTRY
+0x004 Reserved0 : Uint4B
+0x008 SenderPort : Ptr32 Void
+0x00c RepliedToThread : Ptr32 _ETHREAD
+0x010 PortContext : Ptr32 Void
Generated from the value of a global epoch LpcpNextMessageId. Used to uniquely identify a message.
Points to the LPCP_PORT_OBJECT of the client communication port
Is the list entry that is used to queue the message to the Server Communication/Connection Port’s MsgQueue.ReceiveHead
Is a copy of the message buffer that was provided as the Request parameter to the call to NtRequestWaitReplyPort() or a copy the message buffer that was provided as the Reply parameter to NtReplyWaitRecivePortEx().
LPC related fields in Thread Data Structure
kd> dt nt!_ETHREAD -y Lpc
+0x1c8 LpcReplyChain : _LIST_ENTRY
+0x1f4 LpcReplySemaphore : _KSEMAPHORE
+0x208 LpcReplyMessage : Ptr32 Void
+0x208 LpcWaitingOnPort : Ptr32 Void
+0x228 LpcReceivedMessageId : Uint4B
+0x23c LpcReplyMessageId : Uint4B
+0x250 LpcReceivedMsgIdValid : Pos 0, 1 Bit
+0x250 LpcExitThreadCalled : Pos 1, 1 Bit
The following table describes the fields of the ETHREAD data structure that are used for communication between LPC Client and Server process.
Contains the ID of the message sent to the server while the client thread is waiting on the response.
Used by the LPC Client Thread to stores the LPC client communication port on which it is waiting for a reply.
Used to block the LPC Client Thread while the server is processing the LPC message.
Used by the Lpc APIs to determine if the thread is currently in the process of exiting and if so returns STATUS_THREAD_IS_TERMINATING from the API call.
Used to queue the thread to the list of threads waiting for replies from the server communication/connection port. The list head of this list is in LPCP_PORT_OBJECT->LpcReplyChainHead.
Important global LPC connection ports
Used for the following tasks:
This port is used by the CSRSS to manage Window Stations i.e. multiple sessions.
This port is used by IO manager and other components to send error log entries to the EventLog service.
Important LPC port fields in the Process
LSA uses this port for EFS (encrypted file system) and authentication.
When a thread does not catch an exception an LPC_EXCEPTION message is sent to this port. A subsystem might register an LPC port in this field to receive second chance exception information of process running in the subsystem. Default action of CSRSS is to terminate the process.
The kernel dispatches user mode exceptions to the process debugger on this port. Debug outputs using OutputDebugString( ) are passed to this port as DBG_PRINTEXCEPTION_C.
Important LPC port fields in the Thread
This is not a single port rather a chain of ports registered with the Process Manager. The process manager notifies all ports when the thread terminates.
This field has the address of the client communication port when the client thread waits for a reply to a LPC message from a server thread.
Stay tuned...in the next post we will roll our sleeves up and dig into the debug extensions for LPC and some common states where you may find these calls stuck.
Matthew here again – I want to provide some follow-up information on desktop heap. In the first post I didn’t discuss the size of desktop heap related memory ranges on 64-bit Windows, 3GB, or Vista. So without further ado, here are the relevant sizes on various platforms...
Windows XP (32-bit)
· 48 MB = SessionViewSize (default registry value, set for XP Professional, x86)
· 20 MB = SessionViewSize (if no registry value is defined)
· 3072 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 512 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
· 128 KB = Winlogon desktop heap size
· 64 KB = Disconnect desktop heap size
Windows Server 2003 (32-bit)
· 48 MB = SessionViewSize (default registry value)
· 20 MB = SessionViewSize (if no registry value is defined; this is the default for Terminal Servers)
Windows Server 2003 booted with 3GB (32-bit)
· 20 MB = SessionViewSize (registry value has no effect)
You may also see reduced heap sizes when running 3GB. During the initialization of the window manager, an attempt is made to reserve enough session view space to accommodate the expected number of desktops heaps for a given session. If the heap sizes specified in the SharedSection registry value have been increased, the attempt to reserve session view space may fail. When this happens, the window manager falls back to a pair of “safe” sizes for desktop heaps (512KB for interactive, 128KB for non-interactive) and tries to reserve session space again, using these smaller numbers. This ensures that even if the registry values are too large for the 20MB session view space, the system will still be able to boot.
Windows Server 2003 (64-bit)
· 104 MB = SessionViewSize (if no registry value is defined; which is the default)
· 20 MB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 768 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
· 192 KB = Winlogon desktop heap size
· 96 KB = Disconnect desktop heap size
Windows Vista RTM (32-bit)
· Session View space is now a dynamic kernel address range. The SessionViewSize registry value is no longer used.
· 3072 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 512 KB = Non-interactive desktop heap size (defined in the registry, SharedSection 3nd value)
Windows Vista (64-bit) and Windows Server 2008 (64-bit)
Windows Vista SP1 (32-bit) and Windows Server 2008 (32-bit)
· 12288 KB = Interactive desktop heap size (defined in the registry, SharedSection 2nd value)
· 64 KB = Disconnect desktop heap size
Windows Vista introduced a new public API function: CreateDesktopEx, which allows the caller to specify the size of desktop heap.
Additionally, GetUserObjectInformation now includes a new flag for retrieving the desktop heap size (UOI_HEAPSIZE).
[Update: 3/20/2008 - Added Vista SP1 and Server 2008 info. More information can be found here.]
Howdy folks, David here again with part two of How Windows Starts Up. Today we’ll be covering the Boot Loader Phase. Let’s take a moment to recap where we are at this point. So far, the computer has completed POST, executed the MBR code, located the active partition, executed the Boot Sector code in that partition, found NTLDR and loaded it into memory. So what happens next? During this phase, we’ll be discussing the following files located in the root of your system volume: NTLDR, ntdetect.com and boot.ini.
The Boot Loader Phase begins when NTLDR executes:
1. NTLDR switches the processor from real mode to 32-bit flat memory mode.
How does it do this you ask? NTLDR contains two parts – a 16-bit component which handles switching the processor to Protected Mode and a 32-bit component which handles everything else.
When NTLDR first begins execution, the processor is in an x86-operating mode called Real-Mode. In Real-Mode, no virtual to physical translation of memory addresses occurs, which means that all memory addresses use the 16-bit Segment:Offset format, and only the first 1MB of the computer's physical memory is accessible. However, we don’t have to concern ourselves with this too much because the first action NTLDR takes is to switch the CPU to Protected-Mode.
*Trivia Alert* before we enter Protected-Mode, we enable the A20 Line – you IBM guys know what I’m talking about…
Still no virtual-to-physical translation occurs at this point in the boot process, but a full 32 bits of memory addressing becomes accessible, and now NTLDR can access up to 4GB of physical memory.
2. NTLDR creates enough page tables to make memory below 16MB accessible with paging turned on. This number has been increased with Windows Server 2003.
3. NTLDR enables paging. “Protected-Mode with Paging Enabled” is the mode in which Windows executes during normal operation.
4. NTLDR then starts the FAT and NTFS mini file system drivers contained within its own code. Unlike the Boot Sector's file system code, NTLDR's code is sophisticated enough to read subdirectories.
Note: All disk access from this point until the Kernel Phase begins uses BIOS Interrupt 13h, Extended Interrupt 13h or NTBOOTDD.SYS
5. NTLDR locates and parses Hiberfil.sys to determine if it contains a valid hibernator file. If a valid hibernator file is found, then that file is loaded into memory and execution continues from the point the computer was placed into hibernation.
Hibernation files are interesting. The file either contains a full memory image, or it contains a pointer (ARC path) to the boot partition of the last operating system that entered hibernation. That boot partition will contain another Hiberfil.sys file, which contains a full memory image of the hibernating operating system. If the Hiberfil.sys file is a pointer, the file will contain an ARC path using one of the following syntaxes:
6. If Hiberfil.sys is not found, NTLDR parses the Boot.ini file. If the Boot.ini contains a reference to more than one operating system or boot option, then the boot loader screen is displayed.
This is a sample of a default Boot.ini file from a Windows 2000 Server-based computer:
multi(0)disk(0)rdisk(0)partition(1)\WINNT="Windows 2000 Server" /fastdetect
Let’s talk a bit about what’s in the Boot.ini file. Boot.ini contains a list of ARC paths. Ok, quick – five bonus points to the first person who can tell me what ‘ARC’ stands for… come on… Yes, you there in the red shirt… correct! It stands for ‘Advanced RISC Computing’; now beam down with the captain. The rest of you come with me…
ARC paths may use the “Multi”, “SCSI” or “Signature” syntax.
The “Multi” syntax instructs Windows to rely on the system BIOS to load system files. This means that NTLDR will be using INT13 BIOS calls to find and load NTOSKRNL and any other files on the boot partition.
The “SCSI” syntax tells Windows to load a SCSI device driver to access the boot partition. A copy of the driver must be renamed to NTBOOTDD.SYS and placed in the root of the system partition.
The “Signature” syntax is equivalent to the SCSI syntax, but is used instead to support the Plug and Play architecture in Windows 2000 or later. Because Windows 2000 is a PnP OS, the SCSI controller number instance might vary each time you start Windows 2000, especially if you add new SCSI controller hardware after setup.
The Signature syntax instructs NTLDR to locate the drive with a disk signature that matches the value in the parentheses, regardless of which SCSI controller number that the drive is connected to. Keep in mind that the Signature syntax also requires the presence of an Ntbootdd.sys file in the root of the system partition. If you recall from the previous post, the disk signature value is located in the MBR.
Signature() syntax is used only if one of the following conditions exists:
· The System BIOS or controller hosting the boot partition on which Windows is installed, does not support INT13 Extensions, or has the INT13 Extensions option disabled, and the partition on which you installed Windows is either larger than ~7.8 gigabytes (GB) in size, or the ending cylinder number for that partition is higher than 1024.
· The drive on which you installed Windows is connected to a SCSI controller with SCSI BIOS disabled, so INT13 BIOS calls cannot be used during the boot process.
227704 Windows May Use Signature() Syntax in the Boot.ini File
7. If Boot.ini is not found, NTLDR attempts to continue the boot process by assuming that Windows was installed in the default location on the same partition (c:\windows). If the c:\windows directory is not found, NTLDR displays this error message: “Windows could not start because the following file is missing or corrupt: \winnt root\system32\ntoskrnl.exe”
8. If the boot loader screen is displayed, the user must select an operating system to boot.
a. If the user chooses to run an operating system other than Windows 2003/XP such as DOS or Windows 95/98, NTLDR loads and executes the Bootsect.dos file located in the root of the boot drive.
(1) If the Bootsect.dos file is found, NTLDR loads it into RAM and then passes control to it.
(2) Bootsect.dos then loads IO.sys, MSDOS.sys, Command.com, etc., as necessary and proceeds to boot the downlevel operating system.
(3) If the file is not found in the root, NTLDR displays:
“I/O Error accessing boot sector file multi(0)disk(0)rdisk(0)partition(1): \bootsect.dos”
b. If the user selects a Windows NT-based operating system, NTLDR executes Ntdetect.com.
(1) If the ARC path points to a hard drive or partition that is not on the system, the following is displayed:
Windows could not start because of a computer disk hardware configuration problem. Could not read from the selected boot disk. Check boot path and disk hardware. Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.
(2) The above message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file. The following error message may also be displayed when there are problems with the Signature () value:
Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.
9. Ntdetect.com initializes and detects all supported hardware that it can find. (Troubleshooting tip: There is an Ntdetect.chk on the Windows Resource Kit CD in the Debug.cab file, which can be renamed and used in place of Ntdetect.com to view what Ntdetect.com is detecting during startup. You know you want to – go ahead, I’ll wait…)
Ntdetect.com detects the following components:
a) Computer ID
b) Bus/adapter type
c) Video adapter
e) Communications ports
f) Floppy disks
g) Mouse or other pointing devices
h) Parallel ports
10. Ntdetect.com passes its list of hardware components to NTLDR.
11. If the /SOS switch was used in the Boot.ini for the OS that is loading, NTLDR will display the files it loads into memory from this point on. Keep in mind – none of these files are being initialized. They are just being loaded into RAM at this point. I can’t tell you how many times someone has incorrectly concluded that there is a problem with MUP.SYS because it’s the last file displayed on this screen before their computer hangs. Poor MUP – always getting a bad rap. This list can be useful, however; it can help you identify potential malware that may be loading.
12. NTLDR loads Ntoskrnl.exe and Hal.dll into memory. Again, *loading*, not initializing.
13. NTLDR loads the System registry hive into memory. If Windows cannot load the registry hive, it may be necessary to restore the System hive from the ERD or by using the Windows Recovery Console to copy the files from the Windows\Repair or Windows\Repair\RegBack folder.
14. NTLDR reads the HKLM\System\Select registry key.
a. If the Last Known Good option was not selected, NTLDR looks for the “Default” value in the SELECT key to select which ControlSet to use. (0x1 = ControlSet001, 0x2 = ContolSet002, 0x3 = ControlSet003)
b. If the Last Known Good option was selected, NTLDR looks for the “LastKnownGood” value in the SELECT key to select the ControlSet to use.
15. NTLDR sets the “Current” value in the HKLM\System\Select key to the ControlSet being used.
Note: At this point, NTLDR has determined which ControlSet it needs to use, but we don’t actually create the “CurrentControlSet” symbolic link until the kernel initializes (more on that in our next installment). Therefore, for the purposes of this discussion, I’ll use *ActiveControlSet* to refer to whichever registry ControlSet has been identified as the one NTLDR needs to use.
16. NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Codepage for the “Acp”, “Oemcp”, and “Oemhal” values. These entries are pointers to the table of codepage information stored in the same key. The table has the filenames that correspond to the codepage numbers. NTLDR loads these files into memory. The defaults are C_1252.nls, C_437.nls, and Vgaoem.fon.
A code page is a character set specific to an operating system or hardware platform. It maps character codes to individual characters. Different code pages include different special characters, typically customized for a language or a group of languages. The system uses code pages to translate keyboard input into character values for non-Unicode based applications, and to translate character values into characters for non-Unicode based output displays.
17. NTLDR scans HKLM\System\*ActiveControlSet*\Control\Nls\Language for the “Default” value. This entry is a pointer to the table of language information stored in the same key. NTLDR loads this file into memory. The default is L_intl.nls.
18. NTLDR scans HKLM\System\*ActiveControlSet*\Services for device drivers with a “Start” value of 0x0. It finds their “Group” and “Tag” values also.
19. NTLDR scans HKLM\System\*ActiveControlSet*\Control\ServiceGroupOrder for the order that service groups should be loaded.
20. NTLDR sorts the drivers by “Group” and then by their “Tag” value. Each service or driver has a particular group to which it belongs. All drivers within a specific group are ordered by tag.
21. NTLDR loads the 0x0 device drivers into memory. It does not initialize or start these drivers.
22. The board is set, and the pieces are moving. NTLDR executes Ntoskrnl.exe.
Troubleshooting the Boot Loader Phase
Okay, now that we’ve seen what is supposed to happen, what could possibly go wrong? Well, several things…
First, all file-system activity during this phase uses BIOS Int13, Extended Int13 or Ntbootdd.sys.
1. If your computer uses Extended Int 13, verify that Ext. Int13 is enabled in your system’s BIOS or SCSI BIOS as appropriate.
2. If you are using a SCSI driver, verify that you are using the correct SCSI driver (Ntbootdd.sys). If you are using "SCSI()" syntax in the Boot.ini file, copy the correct device driver for the SCSI controller to the root and then rename it to Ntbootdd.sys. This is the driver that is stored in the Winnt.sbs\System32\Drivers folder (for example, the Aic78xx.sys file). If you use "MULTI()" syntax in the Boot.ini file, you do not need to do this. Use the Windows Recovery Console or a Windows NT boot floppy to complete this step.
301680 HOW TO: Create a Boot Disk for an NTFS or FAT Partition
307654 How to install and use the Recovery Console in Windows XP
Significant Errors during this phase:
· Windows could not start because of a computer disk hardware configuration problem. Could not read from the selected boot disk. Check boot path and disk hardware. Please check the Windows documentation about hardware disk configuration and your hardware reference manuals for additional information.
This error indicates that the ARC path is pointing to a hard drive or partition that is not on the system. This message may also be displayed if you are using Signature() ARC path notation and the disk signature in the MBR is overwritten (for example, by a virus) or changed (corrupted) so that it no longer matches the Signature() value in the Boot.ini file.
· Windows could not start because of the following ARC firmware boot configuration problem: did not properly generate ARC name for HAL and system paths. Please check the Windows documentation about ARC configuration options and your hardware reference manuals for additional information.
This error message may be displayed when there are problems with the Signature () ARC path syntax.
Ntoskrnl.exe missing or corrupt. This error could be the result of an incorrect ARC path in the Boot.ini file. Use the information in the following knowledge base articles to correct the Boot.ini information.
102873 BOOT.INI and ARC Path Naming Conventions and Usage
155222 How to Determine the ARC Path
130921 Creating an FT Boot Disk With Scsi( ) and Multi( ) Identifiers
It is also possible to use the Windows XP Recovery Console to create a new Boot.ini for a Windows NT, Windows 2000 or .NET installation. Please see the following knowledge base article for a description of this process:
291980 A Discussion About the Bootcfg Command and Its Uses
Specify the /SOS switch in the Boot.ini file and monitor the output to the screen. If the system stops responding before anything is output to the screen, then the problem may be with hardware detection. Replace Ntdetect.com with Ntdetect.chk to view the hardware detected by the system during this phase.
103659 Setup Hangs While Inspecting Hardware; How to Use Ntdetect.com
320040 HOW TO: Use Ntdetect.chk to Identify Hardware Problems
If after specifying the /SOS option, you see that an incorrect version of the HAL.DLL is loading, use the following KB articles to replace the HAL or troubleshoot other HAL-related issues:
314477 "Hardware Configuration Problem" Err Msg Starting Windows
237556 Troubleshooting Windows 2000 Hardware Abstraction Layer Issues
330184 "Invalid Boot.ini" or "Windows could not start" error messages
Unable to load System hive
Several problems could happen here. First, consider that the amount of memory we’re dealing with is only 16MB. We have to squeeze NTLDR, the System Registry Hive, the kernel, the HAL, all the boot drivers and the PFN database into a relatively small area. So, with Windows Server 2003, we’ve given ourselves a bit more room.
302594 The System hive memory limitation is improved in Windows Server 2003
If Windows is unable to load the System registry hive into memory, then it may be necessary to restore the System hive from the ERD or use the Windows Recovery Console to copy the files from the Winnt\Repair or Winnt\Repair\RegBack folder whichever version is most recent.
307545 How to Recover from a Corrupted Registry
Table of Startup Type Values
0x0 = Boot
Loaded into RAM by the Boot loader (NTLDR) during Boot Loader Phase, then initialized by the Kernel (NTOSKRNL) during Kernel Phase
0x1 = System
Loaded and initialized by the Kernel during Kernel Phase
0x2 = Auto
Loaded or started automatically at system startup
0x3 = Manual
Driver is manually started by the user or another process
0x4 = Disabled
Driver is not to be started (There are exceptions, such as NTFS)
Well folks, we certainly have covered a lot of ground so far. Any questions? Yes, you there in the front… Ah, why have we not discussed any blue screens? Very good question. We haven’t seen any BSOD’s up to this point in the boot sequence because the kernel isn’t running yet and STOP errors are generated by the kernel. No kernel, no blue screens.
That’s all the time we have this week boys & girls. Join us next time for How Windows Starts Up, The Kernel Phase.
Hi my name is Bob, I’m an Escalation engineer with the Microsoft critical problem resolution team. We had one of our readers ask how much we deal with hardware. Well in response we recently worked on an interesting problem I thought I would share with you. In this case it was interesting because it demonstrated an issue specific to multi-processor machines and something that probably sounded innocuous to the driver writer who caused the problem.
What was the problem?
The problem was that everything worked except the time in the system was not being updated. The RTC had stopped. We spent time reviewing the semantics of how the RTC should function and even hooked up an oscilloscope to the RTC on the motherboard and were able to turn it off and on with the debugger by writing out the correct port. The trace on the scope validated our understanding of what had to be written to the port to turn the clock off. One we had a clear understanding of this we understood what we were looking for in a driver that might cause the problem. Note the clock typically fires every 10ms so you do not need a fast scope to do this.
Special keyboard driver written
In order to catch a dump in state we had to modify the keyboard driver. It would cause an “Int 3” in its ISR instead of calling “bug check” for an E2 stop. Because the RTC was not running the idle thread was not getting quantums and as a result a normal break in would not work. However the system would respond to ISRs.
What was found
All RTC interrupts were stopped - the clock was not running. We checked all the obvious places to see if the RTC was disabled.
We looked at the ICR in the I/O APIC. This is the interrupt configuration register. There is a register for every interrupt pin on the APIC. These registers are used to tell the APIC what vector to send the processor so the processor can service the interrupt. It also has configuration information about level and if it edge triggered and a mask bit. The mask bit was not set.
Below is a reenactment.
0: kd> ed ffd01000
ffd01000 00000034 20 ß Select register 20 which is pin 8.
0: kd> ed ffd01010
ffd01010 000008d1 ß Contents ß Vect D1 Bit 16 the interrupt mask bit is not set so it is OK.
Next check the RTC status register which are I/O ports 70 and 71. Port 70 is the address port. Port 71 is the data port. This information is from an old BIOS book.
0: kd> ob 70 b ß ‘B’ is a control register.
0: kd> ib 71
00000071: 42 ß The value 42 means that the RTC is enabled. Bit 6 is the enable.
So what was it?
The way the RTC works is it will interrupt at a certain interval. When the interrupt is serviced, the status register has to be read to start it again.
We discovered another driver that was reading the clock, this was done by disassembly various drivers in the dump and looking for the I/O operation to ports 70 or 71. The lower addresses selected by port 70 will yield the time when read. That is what the driver was doing.
You would think that simply reading the time in this way would not hurt anything. However, in a multi-processor system, access has to be serialized. There is only one set of I/O ports for the system.
Since it takes two accesses to perform an operation on the clock, one address & one data, a collision between two processors can cause undetermined results.
Below is a timing diagram of the issue;
Proc 0 running OS RTC handler Proc 1 running XYZ driver
T1 Set register select to status register
T2 Set register select to read time
T3 Read status register to restart clock
So at T3 the OS RTC handler reads the wrong register so the clock does not start.
I thought this was an interesting problem that illustrates the need for serialization. And it demonstrates what to look out for in a multi-proc environment. You always have to think “What happens if the other processor does…….”
For more information consult any BIOS developer manuals you may have lying around or this link we found http://www.geocities.com/SiliconValley/Campus/1671/docs/rtc.htm
See the “Status Register C” section, “All future interrupts are disabled until this register is read - your interrupt handler *must* do it.”
This information applies specifically to Windows 2000, Windows XP, and Windows Server 2003. I will blog separately on the boot changes in Windows Vista.
Additional information about this topic may be found in Chapter 5 of Microsoft Windows Internals by Russinovich and Solomon and also on TechNet: http://technet.microsoft.com/en-us/library/bb457123.aspx
The key to understanding and troubleshooting system startup issues is to accurately ascertain the point of failure. To facilitate this determination, I have divided the boot process into the following four phases.
2. Boot Loader
Over the next few weeks, I’ll be describing each of these phases in detail and providing appropriate guidance for troubleshooting relevant issues for each phase.
The Initial Phase of boot is divided into the Power-On Self Test (POST) and Initial Disk Access.
Power-On Self Test
POST activities are fully implemented by the computer’s BIOS and vary by manufacturer. Please refer to the technical documentation provided with your hardware for details. However, regardless of manufacturer, certain generic actions are performed to ensure stable voltage, check RAM, enable interrupts for system usage, initialize the video adapter, scan for peripheral cards and perform a memory test if necessary. Depending on manufacturer and configuration, a single beep usually indicates a successful POST.
Troubleshooting the POST
ü Make sure you have the latest BIOS and firmware updates for the hardware installed in the system.
ü Replace the CMOS battery if it has failed.
ü Investigate recently added hardware (RAM, Video cards, SCSI adapters, etc.)
ü Remove recently added RAM modules.
ü Remove all adapter cards, then replace individually, ensuring they are properly seated.
ü Move adapter cards to other slots on the motherboard.
ü If the computer still will not complete POST, contact your manufacturer.
Initial Disk Access
Depending on the boot device order specified in your computer’s BIOS, your computer may attempt to boot from a CD-ROM, Network card, Floppy disk, USB device or a hard disk. For the purposes of this document, we’ll assume that we’re booting to a hard disk since that is the most common scenario.
Before we discuss the sequence of events that occur during this phase of startup, we need to understand a little bit about the layout of the boot disk. The structure of the hard disk can be visualized this way: (Obviously, these data areas are not to scale)
Hard disks are divided into Cylinders, Heads and Sectors. A sector is the smallest physical storage unit on a disk and is almost always 512 bytes in size. For more information about the physical structure of a hard disk, please refer to the following Resource Kit chapter: http://www.microsoft.com/resources/documentation/windowsnt/4/server/reskit/en-us/resguide/diskover.mspx
There are two disk sectors critical to starting the computer that we’ll be discussing in detail:
· Master Boot Record (MBR)
· Boot Sector
The MBR is always located at Sector 1 of Cylinder 0, Head 0 of each physical disk. The Boot Sector resides at Sector 1 of each partition. These sectors contain both executable code and the data required to run the code.
Please note that there is some ambiguity involved in sector numbering. Cylinder/Head/Sector (CHS) notation begins numbering at C0/H0/S1. However, Absolute Sector numbering begins numbering at zero. Absolute Sector numbering is often used in disk editing utilities such as DskProbe. These differences are discussed in the following knowledge base article:
Q97819 Ambiguous References to Sector One and Sector Zero
Now that we have that straight, when does this information get written to disk? The MBR is created when the disk is partitioned. The Boot Sector is created when you format a volume. The MBR contains a small amount of executable code called the Master Boot Code, the Disk Signature and the partition table for the disk. At the end of the MBR is a 2-byte structure called a Signature Word or End of Sector marker, which should always be set to 0x55AA. A Signature Word also marks the end of an Extended Boot Record (EBR) and the Boot Sector.
The Disk Signature, a unique number at offset 0x1B8, identifies the disk to the operating system. Windows 2000 and higher operating systems use the disk signature as an index to store and retrieve information about the disk in the registry subkey HKLM\System\MountedDevices.
The Partition Table is a 64-byte data structure within the MBR used to identify the type and location of partitions on a hard disk. Each partition table entry is 16 bytes long (four entries max). Each entry starts at a predetermined offset from the beginning of the sector as follows:
Partition 1 0x1BE (446)
Partition 2 0x1CE (462)
Partition 3 0x1DE (478)
Partition 4 0x1EE (494)
The following is a partial example of a sample MBR showing three partition table entries in-use and one empty:
000001B0: 80 01
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00 00 00
000001D0: 81 0A 07 FE FF FF 8A F5 – 7F 00 3D 26 9C 00 00 00
000001E0: C1 FF 05 FE FF FF C7 1B – 1C 01 D6 96 92 00 00 00
000001F0: 00 00 00 00 00 00 00 00 – 00 00 00 00 00 00
Let’s take a look at each of the fields of a partition table entry individually. For each of these explanations, I’ll use the first partition table entry above and highlight the relevant section. Keep in mind that these values are little-endian.
00=Do Not Use for Booting
80=Active partition (Use for Booting)
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
000001C0: 01 00 07 FE BF 09 3F 00 - 00 00 4B F5 7F 00
Only the first 6 bits are used. The upper 2 bits of this byte are used by the Starting Cylinder field.
Uses 1 byte + 2 bits from the Starting Sector field to make up the cylinder value. The Starting Cylinder is a 10-bit number with a maximum value of 1023.
Defines the volume type. 0x07=NTFS
Other Possible System ID Values:
FAT12 primary partition or logical drive (fewer than 32,680 sectors in the volume)
FAT16 partition or logical drive (32,680–65,535 sectors or 16 MB–33 MB)
BIGDOS FAT16 partition or logical drive (33 MB–4 GB)
Installable File System (NTFS partition or logical drive)
FAT32 partition or logical drive
FAT32 partition or logical drive using BIOS INT 13h extensions
BIGDOS FAT16 partition or logical drive using BIOS INT 13h extensions
Extended partition using BIOS INT 13h extensions
Dynamic disk volume
Legacy FT FAT16 disk *
Legacy FT NTFS disk *
Legacy FT volume formatted with FAT32 *
Legacy FT volume using BIOS INT 13h extensions formatted with FAT32 *
Partition types denoted with an asterisk (*) indicate that they are also used to designate non-FT configurations such as striped and spanned volumes.
Ending Head (0xFE=254 decimal)
As with the Starting Sector, it only uses the first 6 bits of the byte. The upper 2 bits are used by the Ending Cylinder field.
Uses 1 byte in addition to the upper 2 bits from the Ending Sector field to make up the cylinder value. The Ending Cylinder is a 10-bit number with a maximum value of 1023.
The offset from the beginning of the disk to the beginning of the volume, counting by sectors.
0x0000003F = 63
The total number of sectors in the volume.
0x007FF54B = 8,385,867 Sectors = 4GB
Are you with me so far? Good! Now, Cylinder/Sector encoding can be a bit tricky, so let’s take a closer look.
Cylinder - Sector Encoding
Cylinder bits 1-8
9 & 10
As you can see, the Sector value occupies the lower 6 bits of the word and the Cylinder occupies the upper 10 bits of the word. In our example, the starting values for Cylinder and Sector are 01 00. Values in the MBR use reverse-byte ordering, which is also referred to as ‘little-endian’ notation. Therefore, we swap the bytes and find that our starting values are Cyl 0, Sec 1.
Our ending values are more interesting: BF 09. First, we swap the bytes and obtain a hex value of 0x09BF. This value in binary notation is 100110111111. The following table illustrates how we derive the correct partition table values from these bytes:
Example: BF 09
10 Cylinder value bits 1-8
The 6 low bits are all set to 1, therefore our Sector value is 111111 or 63. You can see above how the bits are arranged for the Cylinder value. The value above is 1000001001 (521). Since both Cylinder and Head values begin numbering at zero, we have a total of 522 Cylinders and 255 Heads represented here. This gives us an ending CHS value of: 522 x 255 x 63 = 8,385,930 sectors.
Subtracting the starting CHS address (Cylinder 0, Head 1, Sector 1) (63) gives us the total size of this partition: 8,385,867 sectors or 4GB. We can verify this number by comparing it to the Total Sectors represented in the partition table: 4B F5 7F 00. Applying reverse-byte ordering gives us 00 7F F5 4B which equals 8,385,867 sectors.
So, now that we have an understanding of what is contained within the structures on the disk, let’s look at the sequence of events that occur. Remember, this is just after POST has successfully completed.
1. The motherboard ROM BIOS attempts to access the first boot device specified in the BIOS. (This is typically user configurable and can be edited using the BIOS configuration utility.)
2. The ROM BIOS reads Cylinder 0, Head 0, and Sector 1 of the first boot device.
3. The ROM BIOS loads that sector into memory and tests it.
a. For a floppy drive, the first sector is a FAT Boot Sector.
b. For a hard drive, the first sector is the Master Boot Record (MBR).
4. When booting to the hard drive, the ROM BIOS looks at the last two signature bytes of Sector 1 and verifies they are equal to 55AA.
a. If the signature bytes do not equal 55AA the system assumes that the MBR is corrupt or the hard drive has never been partitioned. This invokes BIOS Interrupt 18, which displays an error that is BIOS-vendor specific such as “Operating System not found”.
b. If the BIOS finds that the last two bytes are 55AA, the MBR program executes.
5. The MBR searches the partition table for a boot indicator byte 0x80 indicating an active partition.
a. If no active partition is found, BIOS Interrupt 18 is invoked and a BIOS error is displayed such as “Operating System not found”.
b. If any boot indicator in the MBR has a value other than 0x80 or 0x00, or if more than one boot indicator indicates an active partition (0x80), the system stops and displays “Invalid partition table”.
c. If an active partition is found, that partition’s Boot Sector is loaded and tested.
6. The MBR loads the active partition’s Boot Sector and tests it for 55AA.
a. If the Boot Sector cannot be read after five retries, the system halts and displays “Error loading operating system”.
b. If the Boot Sector can be read but is missing its 55AA marker, “Missing operating system” is displayed and the system halts.
c. The bootstrap area is made up of a total of 16 sectors (0-15). If sector 0 for the bootstrap code is valid, but there is corruption in sectors 1-15, you may get a black screen with no error. In this case, it may be possible to transplant sectors 1-15 (not Sector 0) from another NTFS partition using DskProbe.
d. If the Boot Sector is loaded and it passes the 55AA test, the MBR passes control over to the active partition’s Boot Sector.
7. The active partition’s Boot Sector begins executing and looks for NTLDR. The contents of the Boot Sector are entirely dependent on the format of the partition. For example, if the boot partition is a FAT partition, Windows writes code to the Boot Sector that understands the FAT file system. However, if the partition is NTFS, Windows writes NTFS-capable code. The role of the Boot Sector code is to give Windows just enough information about the structure and format of a logical drive to enable it to read the NTLDR file from the root directory. The errors at this point of the boot process are file system specific. The following are possible errors with FAT and NTFS Boot Sectors.
a. FAT: In all cases it displays “press any key to restart” after either of the following errors.
(1) If NTLDR is not found “NTLDR is missing” is displayed
(2) If NTLDR is on a bad sector, “Disk Error” displays.
b. NTFS: In all cases it displays “Press CTRL+ALT+DEL to restart” after any of the following errors.
(1) If NTLDR is on a bad sector, “A disk read error occurred” is displayed. This message may also be displayed on Windows 2000 or higher systems if Extended Int13 calls are required, but have been disabled in the CMOS or SCSI BIOS. This behavior may also be seen if an FRS (File Record Segment) cannot be loaded or if any NTFS Metadata information is corrupt.
(2) If NTLDR is not found, “NTLDR is missing” displays.
(3) If NTLDR is compressed, “NTLDR is compressed” displays.
8. Once NTLDR is found, it is loaded into memory and executed. At this point the Boot Loader phase begins.
Troubleshooting the Initial Phase (Initial Disk Access)
If you are unable to boot Windows and the failure is occurring very early in the boot process, begin by creating a Windows boot floppy and using it to attempt to boot the system.
Q119467 How to Create a Bootable Disk for an NTFS or FAT Partition
If you are unable to boot to Windows NT with a floppy, skip to the section titled “Unable to boot using floppy”.
If you can successfully start the computer from the boot disk, the problem is limited to the Master Boot Record, the Boot Sector, or one or more of the following files: NTLDR, NTDetect.com, or Boot.ini. After Windows is running, you should immediately back up all data before you attempt to fix the Boot Sector or Master Boot Record.
1. Run a current virus scanning program to verify that no virus is present.
2. Select the method of troubleshooting based upon the error generated.
· Operating System not found
Typical BIOS-specific error indicating no 55AA signature on the MBR. This must be repaired manually using a disk editor such as DskProbe.
Warning: Running FixMBR or FDISK /MBR in this case will clear the partition table.
· Invalid Partition Table
May be repaired using DiskProbe – check for multiple 0x80 boot indicators.
· Error Loading Operating System
This error indicates an invalid Boot Sector. Use the “Repairing the Boot Sector” section below.
· Missing Operating System
This error indicates that the 55AA signature is missing from the Boot Sector. This may be replaced using DskProbe or by using the “Repairing the Boot Sector” section below.
· NTLDR is missing
Use the “Repairing the NTLDR file” section below.
· Disk error (FAT only)
Indicates that NTLDR is located on a bad sector.
· A disk read error occurred (NTFS only)
Indicates that NTLDR is on a bad sector. Use the “Repairing the NTLDR file” section below.
· NTLDR is compressed
Indicates that NTLDR is compressed. Uncompress or replace NTLDR. Note: This error is rare.
· Computer boots to a black screen with blinking cursor
Could be an issue with the MBR or the active partition’s boot code (Q228734). Use the appropriate section below.
Repairing the MBR
Use the FIXMBR command from the Windows Recovery Console to re-write the boot code located in the first 446 bytes of the MBR, but to leave the partition table intact.
326215 How To Use the Recovery Console on a Windows Server 2003-Based Computer That Does Not Start
WARNING: DO NOT use this method if you are receiving “Operating System not found” or other BIOS-specific error. FIXMBR will zero out the partition table if the 55AA signature is missing from the MBR. The data will be unrecoverable.
Repairing the Boot Sector
Use the FIXBOOT command from the Windows Recovery Console to write a new Boot Sector to the system partition.
If you are not able to repair the Boot Sector by using this method, you may repair it manually using the following knowledge base article:
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Repairing the NTLDR file
Use the Recovery Console to copy NTLDR from \i386 directory of the Windows CD-ROM to the root of the hard drive.
Unable to boot using floppy
If you are unable to start the computer with a boot floppy and you are not receiving an error message, then the problem is most likely with your partition tables. If invalid partitions are present and you are unable to start your computer with a boot disk, formatting the disk, reinstalling Windows and restoring from backup may be your only option. It may be possible to recreate the partition table using DskProbe if a backup of the partition information is available. It also may be possible to manually reconstruct the partition table however this is outside the scope of this blog post. We may cover this later.
If you do not have a current backup of the data on the computer, as a last resort a data recovery service may be able to recover some of your information.
Helpful knowledge base articles:
Q272395 Error Message: Boot Record Signature AA55 Not Found
Q155892 Windows NT Boot Problem: Kernel File Is Missing From the Disk
Q228004 Changing Active Partition Can Make Your System Unbootable
Q155053 Black Screen on Boot
Q314503 Computer Hangs with a Black Screen When You Start Windows
Q153973 Recovering NTFS Boot Sector on NTFS Partitions
Okay boys & girls, that’s it for now. Next time, I’ll continue with the boot sequence and cover the Boot Loader phase. This is where things really get interesting… J
Hi again! This is Tate from the CPR team and I’m going to show you how to debug a Server Service hang and the sometimes dreaded Event ID: 2021 and Event ID: 2022. There is much Voodoo about troubleshooting these two events but never fear, it’s possible to debug quickly given the right approach.
Recall, I’ve talked about two other common Server Service events prior, 2019 and 2020. (Event ID: 20xx usually denotes a srv.sys series event) Also, from now on I’m going to refer to the Server Service as srv. It’s in the driver srv.sys that the bulk of the work is happening and as such that’s just what we call it.
Anyway, if you have seen these events before you likely have queried the web with the ID or description and have seen the big KB317249 article.
This is an oldie but goodie, but the problem is that, well it’s big and it would help to have perspective on how srv is supposed to work because all the root causes for the event seem quite varied…but often there is only one of two things happening here.
1.) Blocking in the I/O stack preventing normal amounts of SMB requests from getting processed in a timely fashion. (this is the most common that we see, the SMB pump failure case)
2.) Abnormally large amount of requests directed at srv(this is usually a really chatty SMB application or usually accumulation of ChangeNotify requests).
I’ll show you how to determine which of these two are happening but first let’s start with the event text as a refresher…
Event ID: 2022Source: Srv Description: Server was unable to find a free connection n times in the last s seconds.
Event ID: 2021Source: SrvDescription: Server was unable to create a work item %2 times in the last %3 seconds.
What does this mean?
We really have to start with an explanation of Connections and Work Items. So, internally to srv these are just structures used to represent a connection into srv and an item of work…couldn’t be that straightforward right? It is, when a request comes into srv it has to use an Endpoint, Connection, a Work Item, and other structures to know about where the request came from, what the work is, and how to get the response back to what client.
At its core, srv is just a Kernel Mode Driver that does SMB processing. It takes in network SMB/CIFS requests, converts them into local I/O, and pumps the return back out to the client. But the problem here is why would you be “unable to find” or “unable to create” one of these or what breaks this pump/processing?
But first, let’s comment on storage. Kernel memory is not free and as mentioned in prior blogs these structures take up space in the kernel pools and or may have preconfigured srv nominal limits (see MaxWorkItems in KB317249, etc.) if you hit either of these limits the event will be logged. So as we learned prior running out of Paged or NonPaged pool can hang a machine no problem but can also log these events should Srv not be able to allocate memory on behalf of client requests. (See the Event ID: 2019 2020 blog for that info and troubleshooting). What usually happens here though are these WorkItems build up due to blocking or high load and cause the Event ID: 2019 or 2020…(with the top consumer usually LSwn – LanmanServer WorkItems) and then you will see these along with the 2021/2022s. An event party if you will with the 2021/2022 issue in srv actually exhausting kernel memory pools.
Could you bump up the limits of everything in \LanmanServer via KB317249 not read further and be okay?
Maybe, but it depends how much blocking you may have or how high the load gets before the point of failure. If you are logging this even after the server has been working fine under your normal ebb and flow of load in the past, it may be likely that this article will not completely fix your problem so further work is needed. Keep in mind too that if you bump up the limits in the registry per the above explanation you could actually hit the pool limits sooner! The values in KB317249 are pretty conservative so they shouldn’t hurt but again it might not completely help. Also, if you don’t like changing things until you understand more…read on!
REASON #1 for Event ID: 2021 and 2022
The pump mentioned above is really the Srv!WorkerThread and KQUEUE architecture. When requests come in they are put in KQUEUEs via KeInsertQueue and as you might imagine, the processing/draining of those queues is done by the Srv!WorkerThread(s) via a call to KeRemoveQueue. These threads are the heart of Srv and are the key to determining what is usually happening to break the SMB processing here, so we usually look there for the answer.
To check in on srv you gather a kernel memory.dmp via KB244139 during the problem or with livekd (Sysinternals) perhaps…but the best way is KB244139 or by attaching the kernel debugger. The best time to gather this dump is when the server is somewhat unresponsive via the network (SMB) or you are constantly logging the 2021/2022 events…or if the machine is completely hung after hitting these Events.
With memory.dmp in hand, just load it up and do the following to get the system process’ EPROCESS address via !process 0 0 system
0: kd> !process 0 0 system
PROCESS 89067520 SessionId: 0 Cid: 0008 Peb: 00000000 ParentCid: 0000
DirBase: 00030000 ObjectTable: 89067008 TableSize: 490.
Now change process context using that address with .process /r /p <system EPROCESS address>
0: kd> .process /r /p 89067520
Implicit process is now 89067520
Now let’s dump out all the system threads using !process <system EPROCESS address>
0: kd> !process 89067520
Dumping out the threads…
Once the output of threads is finished, I like to search up from the bottom for “Start Address srv!WorkerThread” looking at the call stacks…
Note: There’s a shortcut here for Windows 2003 to just do a “!stacks 2 Srv!WorkerThread” right away. This output will give you a really quick look at the threads without having to .process into system, etc.
Here’s an example of a “good thread” just waiting for work to come in…sitting in KeRemoveQueue
THREAD 887c14e0 Cid 8.474 Teb: 00000000 Win32Thread: 00000000 WAIT: (WrEventPairLow) KernelMode Non-Alertable
Owning Process 89067520
Wait Start TickCount 160832116 Elapsed Ticks: 203335
Context Switch Count 134294
Start Address srv!WorkerThread (0xbea06880)
Stack Init beb11000 Current beb10d3c Base beb11000 Limit beb0e000 Call 0
Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
beb10d54 8042ef5f 88a10eb8 be9fd320 00000001 nt!KiSwapThread+0x1b1
beb10d78 bea068ec 00000000 00000000 00000000 nt!KeRemoveQueue+0x197
beb10da8 804578c2 be9fd300 00000000 00000000 srv!WorkerThread+0x6c
beb10ddc 8046c966 bea06880 be9fd320 00000000 nt!PspSystemThreadStartup+0x54
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
What does a “bad thread” look like?
Here is a blocked Srv!WorkerThread that is trying to process a FindFirst SMB request…
THREAD 887dd4e0 Cid 8.45c Teb: 00000000 Win32Thread: 00000000 WAIT: (UserRequest) KernelMode Non-Alertable
86941288: (0006,01fc) Flags: 00000884 Mdl: 00000000
Impersonation token: e5d56990 (Level Impersonation)
Wait Start TickCount 160783273 Elapsed Ticks: 252178
NOTE the Tick count here on this thread….it’s been waiting a while, not good.
Context Switch Count 12993
Stack Init bec29000 Current bec284f4 Base bec29000 Limit bec26000 Call 0
bec2850c 8042d893 00000000 882549c8 00000000 nt!KiSwapThread+0x1b1
bec28534 f5ae3408 88255db8 00000006 00000000 nt!KeWaitForSingleObject+0x1a3
bec285bc f5ae4066 bec285e4 00000000 00000000 SomeFiltr+0x3408<<<<<<<making a blocking call here…
bec285fc f5ae19fb 884ce3a0 01941288 8041eecb SomeFiltr+0x4066
bec287a4 804c4994 884ce3a0 00000000 bec28894 SomeFiltr+0x19fb
bec287dc 804531c2 884766c8 00000000 bec28894 nt!IopParseFile+0x44
bec28854 804da4d8 0000051c bec28950 00000040 nt!ObpLookupObjectName+0xf8
bec28964 804a4495 00000000 00000000 00000000 nt!ObOpenObjectByName+0xc8
bec28a40 804a403a 8690d008 00000001 bec28b4c nt!IopCreateFile+0x407
bec28a88 bea0282c 8690d008 00000001 bec28b4c nt!IoCreateFile+0x36
bec28af8 bea050f6 88a0d2f8 8690d008 00000001 srv!SrvIoCreateFile+0x32a
bec28b8c bea0acdb 88a0d2f8 00000001 8046b000 srv!SrvQueryDirectoryFile+0x1c8
bec28c5c bea0936d 88a0d2f8 00000001 00000000 srv!SrvFind2Loop+0x171
bec28cd0 bea07d2f 88a0d201 bea0a9c8 88a0d2f8 srv!DoFindFirst2+0x373
bec28cd8 bea0a9c8 88a0d2f8 88a0d2f8 804636f0 srv!SrvSmbFindFirst2+0x31
bec28cf0 bea089b7 88994f38 88a0d2f8 00000000 srv!ExecuteTransaction+0xdc
bec28d68 be9f27ae 88a0d2f8 be9f2cde 88a0d300 srv!SrvSmbTransaction+0x5af
bec28d70 be9f2cde 88a0d300 88a144a0 bec28da8 srv!SrvProcessSmb+0x46
bec28d84 bea069a9 00000000 00000000 00000000 srv!SrvRestartReceive+0xa2
bec28da8 804578c2 88a14400 00000000 00000000 srv!WorkerThread+0x129
bec28ddc 8046c966 bea06880 88a144a0 00000000 nt!PspSystemThreadStartup+0x54
Note: The tick count (Elapsed Ticks) was not bad for the prior “good thread” because we’ve just been waiting in the call to KeRemoveQueue for that long for work to arrive. No work to do in that queue, no problem. However, in this “bad thread “case we’ve grabbed a WorkItem to process and been waiting for a long time (about 1hr 5 min.) stuck while processing it, yikes.
Why is this bad?
So if the Srv!WorkerThread is the worker thread that drains the WorkItems from this list of request in the KQUEUEs and number of these threads are finite(they are), then we have a problem! As more requests come into srv they will continue to get placed in the KQUEUEs by different event handler threads at the transport layer but with no WorkerThread available to process the queue on the back end (and considering these Connections and WorkItems are stored in Pool and the number is finite as well) this is how and usually why the pump stops and we log these events! So not only might we run out of the configured limits in the registry (KB317249) for srv, we might run out of Non Paged Pool, and we are definitely not giving clients a timely response if at all in this state.
What if I see “good” and “bad” threads in my !process of the system process?
This is likely because of the two queues that we have in service. There are blocking and a non-blocking queues so you may see the threads that are waiting in KeRemoveQueue are actually for the blocking queue whereas your other bad threads are blocked on things like CreateFile, Find, etc. requests…supposed to be shorter time requests, appropriately so, on the the non-blocking work queues. In general, if you see any Srv!WorkerThread blocked in anything but KeRemoveQueue for very long, it’s bad.
So what is the solution?
Referencing the above “bad thread” for SomeFiltr.sys note that the filter driver in question has chosen to make a blocking call to KeWaitForSingleObject. This is blocking a FindFirst request from a client but more importantly it’s consuming an Srv!WorkerThread in the process as stated….so it’s actually the filter driver making the decision to stop the processing so the fault lies with it in this case.
Now it’s worth mentioning that this is a fairly simple 1st tier blocking example but suffice to say any blocking call taking too long in a srv!WorkerThread can have bad consequences. This could have easily been any filter driver or device driver in the I/O stack (between the network and srv and the disk) which is blocking doing some work to “help” process the I/O or more difficult to debug the filter or function driver could be waiting on some other queuing mechanism that it implements…so you see how srv can encounter a dependency failure here.
Another trick that we use is looking at all other thread stacks containing the name of the filter driver to see if it’s doing something recognizable like waiting on another Kernel api, filter, etc…in other words guessing through the black box of it’s decision to wait on this thread and making an educated guess of what work it is trying to do on another…A common case is when the blocked thread may be waiting on a response from an LPC to LSASS which is performing some security check. Then it’s necessary to jump …..
In summary, since we can’t see further into the blocking here through SomeFiltr.sys and there’s no other threads in process with this driver in the call stack we’d recommend testing updating, removing, or contact the vendor of the SomeFiltr.sys filter to see if this is fixed, etc. You can do an “lmvm SomeFiltr” to get the time/date information for the file.
So, what if all the threads are just waiting like in the good thread example?
This could be the unfortunate side of gathering a single memory.dmp to represent a problem that may be transient. If a fraction of a second before you hit the Ctrl+Scroll+Scroll to issue the dump the Srv!WorkerThread unblocks, it may look completely normal. What you can do for this class of issue that has temporary blocking is either try again to get it in state, or have a look at the device stack that srv has to work through to get things done(see devstack below). Usually you will find out what filter drivers you are running, etc. and since that is the #1 reason for this event updating them is usually the solution because we or the vendor have likely already debugged this for you and with the filter vendor’s help have likely fixed it in a later version.
This is why usually the first thing we target is filter drivers before debugging anything for hangs or if a dump is not available. By frequency this is far and away the usual root cause for hangs of many many (did I say many?) symptom types.
How do I check the members of device stack that Srv.sys has to work through to get I/O requests processed, filter drivers I need to update, etc?
This is cool…so did you notice that there was an Irp listed in the prior “bad thread”?
Dump it out…
0: kd> !irp 86941288
Irp is active with 11 stacks 11 is current (= 0x86941460)
No Mdl Thread 887dd4e0: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
Args: 00000000 00000000 00000000 00000000
>[ 0, 0] 1 0 884ce3a0 878977a8 00000000-00000000
Args: bec28778 01000000 00070000 00000000
Now grab the Device Object for SomeFiltr…
0: kd> !devstack 884ce3a0
!DevObj !DrvObj !DevExt ObjectName
> 884ce3a0 \FileSystem\SomeFiltr884ce458
88ba5c80 \Driver\InnocentDriver 88ba5d38
88ba5460 \FileSystem\Ntfs 88ba5518
Now you can see how this thing works….ntfs talks to InnocentDriver and it to SomeFiltr, etc….all these guys have to work flawlessly to not kill the srv request dependent upon this device stack.
What if you didn’t have an Irp, no problem…everthing hangs off Object Manager so start at Ntfs’s device objects…This is cooler…
0: kd> !drvobj \FileSystem\Ntfs <<<<note that’s !DRVobj for the Driver Object
Driver object (88dd2710) is for:
Driver Extension List: (id , addr)
Device Object list:
882b4120 8717a620 88b91800 88b981e0
88b9e600 88ba5460 88f583e0 88dd25f0
0: kd> !devstack 882b4120
!DevObj !DrvObj !DevExt ObjectName
86576020 \FileSystem\SomeFiltr865760d8 <<<<<<there you are Mr. SomeFiltr
> 882b4120 \FileSystem\Ntfs 882b41d8
0: kd> !devstack 8717a620
8792ae00 \FileSystem\SomeFiltr8792aeb8 <<<<<<and here
> 8717a620 \FileSystem\Ntfs 8717a6d8
0: kd> !devstack 88dd25f0
886c9020 \FileSystem\SomeFiltr886c90d8 <<<<<<<and here
88f74020 \Driver\InnocentDriver 88f740d8
> 88dd25f0 \FileSystem\Ntfs 00000000 Ntfs
The prudent thing to do if you aren’t sure of root cause by debugging the Srv!WorkerThreads you have captured in a memory.dmp is at least make sure all the filter drivers in these stacks are updated and perhaps even your storage drivers that are being used under ntfs serving your file share(s).
REASON #2 for Event ID: 2021 and 2022
What we have just discussed was the most common reason for hitting the Event ID: 2021 and 2022 event in blocking by a filter or function driver on the I/O stack and how to identify. There are of course other reasons like bursts of demand and a lower than the maximum configured Srv (the nominal limits mentioned above and discussed in KB317249) or just simply a lot of long term requests that are on the system such as ChangeNotify requests. How would you find out about these? Analyze the network traffic.
There are tons of network sniffers which allow you to gauge what type of SMB traffic is going in between servers so I won’t get into that here. I will only say that if you have the ability to generate Statistics of top SMB commands, that’s the way to go. You could however just filter on SMB and ballpark what kind of request and from which clients are occurring most and given your client configuration if that’s normal. What we see here usually is a misbehaving network application in a loop, etc. hitting the file server with repeated requests for the same file, etc. This is also a great strategy for high system process CPU issues where when you look at the thread call stacks generating the CPU via Process Explorer (sysinternals) you find the base of the activity in Srv!WorkerThreads coming and going rapidly.
I will mention however there are a few known ways to reduce the cumulative load of ChangeNotifications and QueryFile and QueryPathInfo requests that Windows Clients actually generate. You can check out the following KBs for these configuration changes on the client side:
For the client SMB side’s shell: NoRemoteChangeNotify.
“An update is available in Windows to turn off directory notification SMB requests”
For the client SMB side (redirector): InfoCacheLevel
(also has shell change NoRemoteRecursiveEvents )
“Windows XP Explorer Pane flickers on mapped network drives”
I hope you’ve enjoyed this post and hopefully the next time you see these srv events you can nail the root cause with windbg no problem!
Keep that device stack updated and until next time…
Hello - Matthew here again. Today I'll be discussing in detail hang scenario #1 that Tate first mentioned a few blogs posts ago. From a debugging perspective, in an ideal world an application would always provide some kind of feedback when a failure occurs. The reality is that sometimes an application just doesn't do anything in response to a user-initiated action (such as a mouse-click) that results in a failure of some kind. When this happens, how can you determine what is going wrong?
In general, Process Monitor from Sysinternals is a great tool for gathering clues about this kind of failure. If the failure is due to a file system or registry problem, Process Explorer will catch the problematic event. Or if you suspect that the problem is due to something failing over the network, a network trace could help.
Let's assume that the techniques described above do not move us any closer to understanding the failure. What next? If you don't know what the app does in the background, and you don't have source code for the app, how do you proceed? Let's answer those questions by debugging a sample application...
Download the sample application here.
Here's what we know:
1. Clicking the “Button 1” button is supposed to display a useful dialog box
2. For certain users, clicking “Button 1” results in no observable response
3. You don't have access to the source code or debug symbols for the app
4. No one seems to know what the “Button 1” button does behind the scenes
5. The developer of the application hasn't been heard from in 3 years
We need to understand what exactly happens when the user clicks the “Button 1” button. Every window has a WindowProc function that receives messages upon user input. Since a button is considered a "control", the WindowProc for the main application window will get a WM_COMMAND message indicating that the button was clicked. To differentiate which specific control sent the message, each WM_COMMAND message has a control identifier associated with it.
So we need to:
a) Find the control identifier for “Button 1”
b) Find the WindowProc for the main application window
c) Determine where WM_COMMAND messages for the “Button 1” button are handled
d) Figure out what is failing in that code
Let’s get started...
Find the control identifier for “Button 1”
Spy++ (available as part of Visual Studio) is the tool for this job. Click Spy -> Log Messages. Use the Finder Tool to select the main window of ntdbghang1.exe. On the messages tab, click “Clear All” then select “WM_COMMAND”. Click “OK” and SPY++ will start logging. Click “Button 1” in ntdbghang1.exe and you should see one line of text appear in the output window. For good measure, click “Button 2” also, and you should see a second line of text. At this point, your SPY++ window should look something like this....
We can tell from this output, that the control identifier for “Button 1” is 257 (0x101), and the control identifier for “Button 2” is 258 (0x102). We’ll need this information later.
Find the WindowProc for the main application window
We can find the WindowProc address by using SPY++ again. Click Spy -> Find Window and use the Finder Tool to select the main window of ntdbghang1.exe. Make sure “Show properties” is selected, and then click “OK”. You will get a dialog that looks like this....
Note the value of Window Proc. This is the address of the window procedure for the main window (the exact value may differ on your system). We know that this function is a WindowProc, which means it is of the form...
LRESULT CALLBACK WindowProc(
The meaning of wParam and lParam differ depending on what uMsg is being processed. When a WM_COMMAND message is passed, the low-word of wParam is the control identifier, which we determined to be 0x101. The high-word of wParam is the control notification code, which in the case of a button being clicked is BN_CLICKED (literally 0).
So, we specifically are interested in the case where:
uMsg = WM_COMMAND (literally 0x111)
wParam = 0x101
While looking at assembly language for the WindowProc, the stack frame will look like this...
ebp = “old ebp”
ebp+4 = “return address”
ebp+8 = hwnd
ebp+c = uMsg
ebp+10 = wParam
ebp+14 = lParam
Determine where WM_COMMAND messages for the “Button 1” button are handled
Once you have the address of the WindowProc, we can examine the assembly code for that function using Windbg. Launch windbg.exe and select “File -> Attach to a Process”. Select “ntdbghang1.exe” from the list and click “OK”. When the debugger breaks in, you can unassemble the beginning of the WindowProc by typing “u <address>”. Based on the address on my system, that command will be “u 01002830”. To unassemble more just type “u” again. I’m now going to unassemble the relevant code and explain what it means along the way...
0:001> u 01002830
The first 3 instructions are a function prologue, just getting everything set up.
01002830 8bff mov edi,edi
01002832 55 push ebp
01002833 8bec mov ebp,esp
Here is the code that moves the value of uMsg into ecx
01002835 8b4d0c mov ecx,dword ptr [ebp+0Ch]
Now the app enters the assembly equivalent of a switch statement.
We are interested in the check for uMsg = WM_COMMAND
if uMsg = WM_CREATE (1) goto 01002893
01002838 49 dec ecx
01002839 7458 je ntdbghang1+0x2893 (01002893)
if uMsg = WM_ DESTROY (2) goto 01002889
0100283b 49 dec ecx
0100283c 744b je ntdbghang1+0x2889 (01002889)
if uMsg = WM_CLOSE (0x10) goto 01002889
0100283e 83e90e sub ecx,0Eh
01002841 743b je ntdbghang1+0x287e (0100287e)
if uMsg = WM_COMMAND (0x111) goto 01002853
01002843 b801010000 mov eax,101h
01002848 2bc8 sub ecx,eax
0100284a 7407 je ntdbghang1+0x2853 (01002853)
From the above switch statement we see that when uMsg = WM_COMMAND, execution moves to 01002853. So let’s go there...
0:001> u 01002853
Move the value of wParam into edx
01002853 8b5510 mov edx,dword ptr [ebp+10h]
If LOWORD(wParam) == 0x101 (control ID for button 1) goto 0100286f
01002856 0fb7ca movzx ecx,dx
01002859 2bc8 sub ecx,eax
0100285b 7412 je ntdbghang1+0x286f (0100286f)
From the above assembly we can see that right away we are checking for the control ID for button 1. This is the code path we want to follow....
0:001> u 0100286f
If HIWORD(wparam) != BN_CLICKED (0) then goto 0100289b
Otherwise, call the function at 010027f6
0100286f c1ea10 shr edx,10h
01002872 6685d2 test dx,dx
01002875 7524 jne ntdbghang1+0x289b (0100289b)
01002877 e87affffff call ntdbghang1+0x27f6 (010027f6)
Now it is clear that when the BN_CLICKED occurs for control id 0x101, we are going to execute the function at address 010027f6.
Figure out what is failing
So now we know what code is going to run every time the button is clicked (the function at 010027f6). Let’s examine that code and figure out what may be failing. We’ll use “uf” to unassemble the entire function, as it is rather short. During this first pass we’ll make up names for our local variables and functions that are called, and then we’ll revisit and figure out what they are.
0:001> uf 010027f6
010027f6 8bff mov edi,edi
010027f8 55 push ebp
010027f9 8bec mov ebp,esp
010027fb 51 push ecx
010027fc 56 push esi
call func1(0x20) [func1 address at 01002dca]
010027fd 6a20 push 20h
010027ff c745fc10000000 mov dword ptr [ebp-4],10h
01002806 e8bf050000 call ntdbghang1+0x2dca (01002dca)
Store the result of func1 in esi (call it localvar2)
0100280b 8bf0 mov esi,eax
Call func2(localvar2, &localvar1) [func2 address at 0100b484]
0100280d 59 pop ecx
0100280e 8d45fc lea eax,[ebp-4]
01002811 50 push eax
01002812 56 push esi
01002813 e86c8c0000 call ntdbghang1+0xb484 (0100b484)
If func2 returns 0, then goto 01002821
01002818 85c0 test eax,eax
0100281a 7405 je ntdbghang1+0x2821 (01002821)
Call func3() [func3 address at 0100278d]
0100281c e86cffffff call ntdbghang1+0x278d (0100278d)
Call func4(localvar2) [func4 address at 01002ce3]
01002821 56 push esi
01002822 e8bc040000 call ntdbghang1+0x2ce3 (01002ce3)
Clean up and exit the function
01002827 59 pop ecx
01002828 5e pop esi
01002829 c9 leave
0100282a c3 ret
So let’s translate this to some pseudo-code...
localvar1 = 0x10;
localvar2 = func1(0x20);
So the behavior of the function will differ, depending on the results of func2. Let’s see if we can figure out what func2 does...
0:001> u 0100b484
0100b484 ff2568110001 jmp dword ptr [ntdbghang1+0x1168 (01001168)]
0100b48a cc int 3
0100b48b cc int 3
0100b48c cc int 3
0:001> dps 01001168 L1
01001168 70b88cb1 WINSPOOL!GetDefaultPrinterW
GetDefaultPrinterW is a public API function. Here’s the function prototype from MSDN...
LPTSTR pszBuffer, // printer name buffer
LPDWORD pcchBuffer // size of name buffer
This fits our pseudo-code, since the function takes two parameters and returns a BOOL. Let’s update our pseudo code to match what we know, replacing func2 with GetDefaultPrinter, etc...
DWORD cchBuffer = 0x10;
LPWSTR pszBuffer = func1(0x20);
Based on the above, we can make a couple of assumptions. Func1 is likely an allocator function of some kind (such as malloc) and func4 is likely a memory free function (such as free). Since the app is using the Unicode version of GetDefaultPrinter (it ends with a W) it make sense that func1 is an allocator asking for 0x20 bytes, and the size of the buffer, in characters, passed to GetDefaultPrinterW is 0x10. So that accounts for all of the functions, except for func3. We could unassemble func3 to see what it does, but that may not be needed. From the pseudo-code we’ve already created, we can see that a failure in GetDefaultPrinterW will prevent the button click function from doing anything except allocating and freeing some memory. So GetDefaultPrinterW is a likely potential point of failure.
Let’s test this theory out. We’ll set a breakpoint on GetDefaultPrinterW and see if it is failing upon the button click.
0:001> bp WINSPOOL!GetDefaultPrinterW
<Now click Button 1>
Breakpoint 0 hit
eax=0006fb84 ebx=00000000 ecx=00000020 edx=00dc0e98 esi=00dc0e70 edi=0006fc0c
eip=70b88cb1 esp=0006fb74 ebp=0006fb88 iopl=0 nv up ei pl nz na po nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000202
70b88cb1 8bff mov edi,edi
// Dump out the call stack...
0006fb70 01002818 00dc0e70 0006fb84 00000111 WINSPOOL!GetDefaultPrinterW
0006fb88 0100287c 0006fbbc 75d41a10 00320f78 ntdbghang1+0x2818
0006fb90 75d41a10 00320f78 00000111 00000101 ntdbghang1+0x287c
0006fbbc 75d41ae8 01002830 00320f78 00000111 USER32!InternalCallWinProc+0x23
0006fc34 75d4286a 00000000 01002830 00320f78 USER32!UserCallWinProcCheckWow+0x14b
0006fc74 75d42bba 00a90b80 0095ee08 00000101 USER32!SendMessageWorker+0x4b7
// Go to the return address. Note that eax=0, meaning GetDefaultPrinterW returned FALSE.
eax=00000000 ebx=00000000 ecx=76f22033 edx=00e10178 esi=00dc0e70 edi=0006fc0c
eip=01002818 esp=0006fb80 ebp=0006fb88 iopl=0 nv up ei pl zr na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
01002818 85c0 test eax,eax
// Check the last error...
LastErrorValue: (Win32) 0x7a (122) - The data area passed to a system call is too small.
// Resume execution
The error returned was 0x7a = ERROR_INSUFFICIENT_BUFFER. It looks like the buffer of 0x20 bytes being passed to GetDefaultPrinterW is too small to hold the name of the default printer for that user. This explains why the app works for some users and fails for others. A quick change of the default printer name to something less than 0x10 characters (16 decimal) will work around this issue.
To wrap things up, here is the C source for the Button1_OnClick function (at address 010027f6 in the assembly) so you can see how this was actually written...
DWORD cch = 16;
pPrinterName = (LPTSTR) malloc(16 * sizeof(TCHAR));
Look for an upcoming blog post that shows another way this problem could have been approached – using the “wt” command in the debugger. I hope you found this useful and can apply to real-world debugging scenarios. Please post any questions / comments!
Written by Jeff Dailey
Hello, my name is Jeff, I’m a escalation engineer on the Microsoft CPR (critical problem resolution) platforms team. This blog entry is part 2 of my Hung Window?, No source?, No problem!! Part 1 blog. In this lab we will be debugging a problem involving multi threaded applications and synchronization objects and the types of things that can go wrong, and how to track them down. This process and training lab is right out of our CPR Training curriculum. In order to do the lab I have prepared for you, you will need to have downloaded the dumphungwindow and then badwindow.exe from my earlier blog post. You will also need to install the debugging tools for windows.
Previous blog http://blogs.msdn.com/ntdebugging/archive/2007/05/29/detecting-and-automatically-dumping-hung-gui-based-windows-applications.aspx
After you have both of these installed we can get started. We are going to debug and figure out why the window stops repainting and does not respond.
Step 1 start badwindow.exe
Step 2 run dumphungwindow.exe
Step 3 select Hang \ Hang Type 2 from the BadWindow.exe menu.
You should see dump hung window detect your window no processing messages and as a result it will dump the badwindow.exe process
************ OUTPUT *************
C:\source\dumphungwindow\release>dumphungwindow.exeDumps will be saved in C:\Users\jeffda\AppData\Local\Temp\scanning for hung windows
Hung Window found dumping process (12912) badwindow.exeDumping unresponsive processC:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_5_Pid12912_badwindow.exe.dmp
Hung Window found dumping process (12912) badwindow.exe
Dumping unresponsive processC:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_14_2007_Time7_34_24_Pid12912_badwindow.exe.dmp\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_56_Pid7924_badwindow.exe.dmp
Step 4 create a local symbol directory at C:\websymbols
Step 5 set your symbol path under file \ symbols in windbg to SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols
See http://www.microsoft.com/whdc/devtools/debugging/debugstart.mspx for details.
Step 6 start windbg select file\open crash dump and select the first dump file.
Your initial output should look like this.
Microsoft (R) Windows Debugger Version 6.7.0001.0
Copyright (c) Microsoft Corporation. All rights reserved.
***** WARNING: Your debugger is probably out-of-date.
***** Check http://dbg for updates.
Loading Dump File [C:\Users\jeffda\AppData\Local\Temp\HWNDDump_Day6_12_2007_Time9_53_34_Pid7924_badwindow.exe.dmp]
User Mini Dump File with Full Memory: Only application data is available
Symbol search path is: SRV*c:\websymbols*http://msdl.microsoft.com/download/symbols;srv
Executable search path is:
Windows Vista Version 6000 MP (2 procs) Free x86 compatible
Product: WinNt, suite: SingleUserTS
Debug session time: Tue Jun 12 09:53:35.000 2007 (GMT-4)
System Uptime: 11 days 18:41:43.089
Process Uptime: 0 days 0:00:32.000
Loading unloaded module list
eax=00000000 ebx=00000002 ecx=00000000 edx=00000000 esi=00000000 edi=00000000eip=777faec5 esp=0017faf4 ebp=0017fb8c iopl=0 nv up ei pl nz na po nccs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202ntdll!ZwWaitForMultipleObjects+0x15:777faec5 c21400 ret 14h
Step 7 from the debugger prompt (Locate a prompt at the bottom of windbg that has a 0:000> next to it. Type ~* k
You will most likely see output similar to this.
. 0 Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen
0017faf0 76e4edb5 ntdll!ZwWaitForMultipleObjects+0x15
0017fb8c 76e430c3 kernel32!WaitForMultipleObjectsEx+0x11d
0017fba8 00401502 kernel32!WaitForMultipleObjects+0x18
0017fbc8 0040139b badwindow!hangtype2+0x42 [c:\source\badwindow\badwindow\badwindow.cpp @ 340]
0017fc24 772a87af badwindow!WndProc+0x17b [c:\source\badwindow\badwindow\badwindow.cpp @ 274]
0017fc50 772a8936 user32!InternalCallWinProc+0x23
0017fcc8 772a8a7d user32!UserCallWinProcCheckWow+0x109
0017fd2c 772a8ad0 user32!DispatchMessageWorker+0x380
0017fd3c 004010fb user32!DispatchMessageW+0xf
0017ff0c 00401817 badwindow!wWinMain+0xfb [c:\source\badwindow\badwindow\badwindow.cpp @ 124]
0017ffa0 76eb19f1 badwindow!__tmainCRTStartup+0x150 [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]
0017ffac 7782d109 kernel32!BaseThreadInitThunk+0xe
0017ffec 00000000 ntdll!_RtlUserThreadStart+0x23
1 Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen
026ffebc 777ecfad ntdll!ZwWaitForSingleObject+0x15
026fff20 777ecf78 ntdll!RtlpWaitOnCriticalSection+0x154
026fff48 0040153c ntdll!RtlEnterCriticalSection+0x152
026fff64 757c2848 badwindow!hangtype2threada+0x2c [c:\source\badwindow\badwindow\badwindow.cpp @ 358]
026fff9c 757c28c8 msvcr80!_endthread+0x4b
026fffa0 76eb19f1 msvcr80!_endthread+0xcb
026fffac 7782d109 kernel32!BaseThreadInitThunk+0xe
026fffec 00000000 ntdll!_RtlUserThreadStart+0x23Note the [NUMBER] Id: indicates the thread number, to the right of this you have the process id and thread id > PROCESS 3270.2b10 < THREAD | THREAD STATE > Suspend: 0 Teb: 7efdd000 Unfrozen
Each of these threads represents a call stack. The most recent call is at the TOP of the stack. As each call is made the stack grows larger. Looking at thread 0 you will see that our winproc appears to be blocked on a call to hangtype2, hangtype2 is making a call to WaitForMultipleObjects Lets look more closely at WaitForMultipleObjects
Docs for WaitForMultipleObjects
DWORD WINAPI WaitForMultipleObjects( DWORD nCount, const HANDLE* lpHandles, BOOL bWaitAll, DWORD dwMilliseconds
Lets look at the parameters passed to
0017faf0 76e4edb5 00000002 0017fb40 00000000 ntdll!ZwWaitForMultipleObjects+0x15 (FPO: [5,0,0])
0017fb8c 76e430c3 0017fb40 0017fbc4 00000001 kernel32!WaitForMultipleObjectsEx+0x11d (FPO: [Non-Fpo])
0017fba8 00401502 00000002 0017fbc4 00000001 kernel32!WaitForMultipleObjects+0x18 (FPO: [Non-Fpo])
0017fbc8 0040139b 00401220 0017fbfc 00401220 badwindow!hangtype2+0x42 (FPO: [0,2,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 340]
0017fc24 772a87af 00063d36 00000111 00008004 badwindow!WndProc+0x17b (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 274]
0017fc50 772a8936 00401220 00063d36 00000111 user32!InternalCallWinProc+0x23
0017fcc8 772a8a7d 00000000 00401220 00063d36 user32!UserCallWinProcCheckWow+0x109 (FPO: [Non-Fpo])
0017fd2c 772a8ad0 00401220 00000000 0017ff0c user32!DispatchMessageWorker+0x380 (FPO: [Non-Fpo])
0017fd3c 004010fb 0017fd54 00403938 00000001 user32!DispatchMessageW+0xf (FPO: [Non-Fpo])
0017ff0c 00401817 00400000 00000000 00280f8c badwindow!wWinMain+0xfb (CONV: stdcall) [c:\source\badwindow\badwindow\badwindow.cpp @ 124]
0017ffa0 76eb19f1 7efde000 0017ffec 7782d109 badwindow!__tmainCRTStartup+0x150 (FPO: [Non-Fpo]) (CONV: cdecl) [f:\sp\vctools\crt_bld\self_x86\crt\src\crtexe.c @ 589]
0017ffac 7782d109 7efde000 0017fb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
0017ffec 00000000 00401987 7efde000 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])
The first parameter is 00000002 this is the number of objects we are waiting on.
The second parmeter is the address of the array of objects, Lets dump it out and take a look at the objects
0:000> dd 0017fbc4
0017fbc4 000000c4 000000c8 0040139b 00401220
0017fbd4 0017fbfc 00401220 00063d36 0017fc48
0017fbe4 772a8989 772a894d 53ca28e7 00000000
0017fbf4 00063d36 00401220 00000000 00000000
0017fc04 00000000 0017fca0 00000001 00000000
0017fc14 ffffffff 772a88e5 53c4f4b4 75c12459
0017fc24 0017fc50 772a87af 00063d36 00000111
0017fc34 00008004 00000000 00401220 dcbaabcd
0:000> !handle 000000c4
0:000> !handle 000000c8
Type <Error retrieving type>
Looking at the second value it would appear as if all the info needed to get the handle type info is not in the dump for some reason. Handles are a index into the handle table in kernel. It’s possible when the dump was created that no all the handle info was included. However that’s ok. We have a simple way to work around this and see what happened.
We can use UF from part 1 of this blogs on badwindow.exe, All we need to do is UF the return address of
WaitForMultipleObjects. Lets run through the assembly and see what we are waiting on.
0:000> uf 00401502
badwindow!hangtype2 [c:\source\badwindow\badwindow\badwindow.cpp @ 334]:
Reserving space on the stack by decrementing ESP (The stack pointer, remember the stack grows down in memory)
334 004014c0 83ec08 sub esp,8
Save the state of ESI so it can be restored later.
334 004014c3 56 push esi
Get the pointer to _beginthread from the import table and store it in ESI Docs on being thread http://msdn2.microsoft.com/en-us/library/kdzttdcb(VS.80).aspx 1 (start address), 2 (stack size), 3 (arglist)
337 004014c4 8b3580204000 mov esi,dword ptr [badwindow!_imp___beginthread (00402080)]
Push the last arg to _beginthread on the stack this is the arg list for _beginthread in this case 0 we are passing no args.
337 004014ca 6a00 push 0
This is our stack space. Note in the debugger you can do a ? 2EE0h and it will show value in Hex and Dec, this value is 12000 dec.
337 004014cc 68e02e0000 push 2EE0h
This is the start address for our thread function, in this case hangtype2threada
337 004014d1 6810154000 push offset badwindow!hangtype2threada (00401510)
Here we call _beginthread this starts the thread up. The return value is a thread handle.
337 004014d6 ffd6 call esi
338 004014d8 6a00 push 0
This is our stack space arg for _beingthread
338 004014da 68e02e0000 push 2EE0h
This is the start address for our thread function, in this case hangtype2threadb
338 004014df 6870154000 push offset badwindow!hangtype2threadb (00401570)
We are now storing EAX (The return from the first _beginthreadcall ) into ESP+1ch (on our stack)
338 004014e4 8944241c mov dword ptr [esp+1Ch],eax
338 004014e8 ffd6 call esi
Any time we add to ESP We are shrinking or cleaning up the stack.
338 004014ea 83c418 add esp,18h
We are pushing our wait time for WaitForMultipleObjects in this case 0FFFFFFFFh (-1) Wait forever.
340 004014ed 6aff push 0FFFFFFFFh
Storing EAX on the stack, this is the thread handle from our last _beginthread call.
340 004014ef 8944240c mov dword ptr [esp+0Ch],eax
This is our wait logic, in this case it’s WaitAll, so we will only unblock once all handles are signaled or in this case threads complete running.
340 004014f3 6a01 push 1
Here we are loading the pointer of the stack location that contains our handles that we will wait on into EAX.
340 004014f5 8d44240c lea eax,[esp+0Ch]
Now we push the pointer to our handles / objects on the to the stack.
340 004014f9 50 push eax
And this is the count of objects, 2 in this case both of them threads.
340 004014fa 6a02 push 2
Now we call our WaitForMultipleObjects call to wait on hangtype2threadb and hangtype2threada to finish executing.
340 004014fc ff1510204000 call dword ptr [badwindow!_imp__WaitForMultipleObjects (00402010)]
Restore our ESI register, this will happen when we return.
340 00401502 5e pop esi
Dec our stack pointer.
342 00401503 83c408 add esp,8
Return we are done.
342 00401506 c3 ret
Here is the source.
handles = (HANDLE)_beginthread(hangtype2threada, 12000, NULL);
handles = (HANDLE)_beginthread(hangtype2threadb, 12000, NULL);
So what went wrong? Let’s look at our threads again.
We have our main message pump thread thread 0 waiting on two threads, One is still running badwindow!hangtype2threada and the other one is gone or has completed hangtype2threadb.
0 Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen
Looking at hangtype2threada it would seem that it is blocked on RtlEnterCriticalSection.
026fffec 00000000 ntdll!_RtlUserThreadStart+0x23
Lets look and see what is happening with this critical section call..
First lets set our thread context to thread id 1
eax=00000000 ebx=00000000 ecx=00000000 edx=00000000 esi=00403780 edi=00000000
eip=777fa69d esp=026ffec0 ebp=026fff20 iopl=0 nv up ei pl nz na po nc
cs=0023 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00000202
777fa69d c20c00 ret 0Ch
Lets get our call stack and get the first and only arg for entercriticalsection.
ChildEBP RetAddr Args to Child
026ffebc 777ecfad 000000cc 00000000 00000000 ntdll!ZwWaitForSingleObject+0x15 (FPO: [3,0,0])
026fff20 777ecf78 00000000 00000000 76e61d5a ntdll!RtlpWaitOnCriticalSection+0x154 (FPO: [Non-Fpo])
026fff48 0040153c 00403780 00000000 00000000 ntdll!RtlEnterCriticalSection+0x152 (FPO: [Non-Fpo])
026fff64 757c2848 00000000 51b22bb2 00000000 badwindow!hangtype2threada+0x2c (FPO: [Uses EBP] [1,1,0]) (CONV: cdecl) [c:\source\badwindow\badwindow\badwindow.cpp @ 358]
026fff9c 757c28c8 76eb19f1 02274358 026fffec msvcr80!_endthread+0x4b (FPO: [Non-Fpo])
026fffa0 76eb19f1 02274358 026fffec 7782d109 msvcr80!_endthread+0xcb (FPO: [Non-Fpo])
026fffac 7782d109 02274358 026ffb9e 00000000 kernel32!BaseThreadInitThunk+0xe (FPO: [Non-Fpo])
026fffec 00000000 757c286e 02274358 00000000 ntdll!_RtlUserThreadStart+0x23 (FPO: [Non-Fpo])
We can do a couple of things at this point first lets look at the CS. (Critical Section)
0:001> !cs 00403780
Critical section = 0x00403780 (badwindow!csCritSec1+0x0)
DebugInfo = 0x0029bd40
LOCKED < It’s LOCKED.
LockCount = 0x1
WaiterWoken = No
OwningThread = 0x00002048 < This is the owning thread.
RecursionCount = 0x14
LockSemaphore = 0xCC
SpinCount = 0x00000000
Are there any other locked critical sections? !locks will tell us and no this is the only one.
CritSec badwindow!csCritSec1+0 at 00403780
Scanned 156 critical sections
What thread are running in our process and what is 2048 doing?
# 0 Id: 3270.2b10 Suspend: 0 Teb: 7efdd000 Unfrozen
. 1 Id: 3270.2cd0 Suspend: 0 Teb: 7efda000 Unfrozen
Ok here is our problem. Apparently both threads hangtype2threada and hangtype2threadb were using this same critical section however something happened to hangtype2threadb. We need to figure out what happened so let’s go take a look at that function.
Looking back where we unassembled badwindow!hangtype2 we got it’s address, lets verify it with a ln (list near), we are lucky enough to have symbols in this case.
0:001> ln 00401570
(00401570) badwindow!hangtype2threadb | (004015e0) badwindow!hangtype3thread
badwindow!hangtype2threadb (void *)
Looks like we have an exact match. Now lets unassemble it and see what went wrong.
0:001> uf 00401570
badwindow!hangtype2threadb [c:\source\badwindow\badwindow\badwindow.cpp @ 370]:
370 00401570 51 push ecx
370 00401571 53 push ebx
Move the pointer to sprint into EBX
371 00401572 8b1d7c204000 mov ebx,dword ptr [badwindow!_imp__sprintf (0040207c)]
371 00401578 55 push ebp
Move the pointer to outputdebugstring into ebp
371 00401579 8b2d14204000 mov ebp,dword ptr [badwindow!_imp__OutputDebugStringA (00402014)]
371 0040157f 56 push esi
Move the pointer to EnterCriticalSection into ESI
371 00401580 8b351c204000 mov esi,dword ptr [badwindow!_imp__EnterCriticalSection (0040201c)]
371 00401586 57 push edi
Move the pointer for Sleep into EDI
371 00401587 8b3d0c204000 mov edi,dword ptr [badwindow!_imp__Sleep (0040200c)]
Save 14h or 20dec to ESP+10h (A local on the stack) Maybe this is a counter?
371 0040158d c744241014000000 mov dword ptr [esp+10h],14h
Push the address of the critical section csCritSec1 00403780 onto the stack.
374 00401595 6880374000 push offset badwindow!csCritSec1 (00403780)
374 0040159a ffd6 call esi
Push 0xFA, 250Dec on the stack
376 0040159c 68fa000000 push 0FAh
Call Sleep (Wait for 250ms)
376 004015a1 ffd7 call edi
Push pointer to value on the stack. When in doubt dump it out.
0:001> db 004021e4
004021e4 57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79 We are in hangty
004021f4 70 65 32 74 68 72 65 61-64 62 00 00 48 00 00 00 pe2threadb..H...
00402204 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
00402214 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
377 004015a3 68e4214000 push offset badwindow!`string' (004021e4)
Push pointer on the stack what is it?
0:001> db 00403380
00403380 57 65 20 61 72 65 20 69-6e 20 68 61 6e 67 74 79 We are in hangty
00403390 70 65 32 74 68 72 65 61-64 62 00 00 00 00 00 00 pe2threadb......
004033a0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
004033b0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 ................
377 004015a8 6880334000 push offset badwindow!szTrace (00403380)
377 004015ad ffd3 call ebx
Clean up the stack
377 004015af 83c408 add esp,8
This is out buffer we just did the sprintf to
378 004015b2 6880334000 push offset badwindow!szTrace (00403380)
378 004015b7 ffd5 call ebp
Push csCritSec2’s address on the stack
380 004015b9 6868374000 push offset badwindow!csCritSec2 (00403768)
Call LeaveCriticalSection for csCritSec2’s
380 004015be ff1524204000 call dword ptr [badwindow!_imp__LeaveCriticalSection (00402024)]
Decrement a counter on the stack this is a local counting down to zero..
382 004015c4 836c241001 sub dword ptr [esp+10h],1
Check counter local counting down to zero if we are not ZERO yet dump to the top of the loop.
382 004015c9 75ca jne badwindow!hangtype2threadb+0x25 (00401595)
Restore all the registers and then return
382 004015cb 5f pop edi
382 004015cc 5e pop esi
382 004015cd 5d pop ebp
382 004015ce 5b pop ebx
387 004015cf 59 pop ecx
387 004015d0 c3 ret
Did you see the BUG?, Look closely, If you need it here is the source.
void __cdecl hangtype2threadb(void *)
sprintf(szTrace, "We are in hangtype2threadb");
We are entering one critical section and leaving another. Then we drop out of the function once we dec our counter to zero and the thread terminates leaving csCritSec1 entered but never left. The fix for this seems rather simple, we just need to leave critsec1 vis leave creatsec2. That should fix it. But it we don’t have the source how can we verify that?
SIMPLE! We just modify the machine code in the debugger! Often if we think we know how to fix something we will edit the code bytes to make the machine code do the right thing and let it run.
Do do this, from the command line in your debuggers directory run windbg.exe C:\source\badwindow\release\badwindow.exe asuming you have your bad window sampl