• Ntdebugging Blog

    Understanding Storage Timeouts and Event 129 Errors

    • 6 Comments

    Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

     

    Below is an example 129 event:

     

    Event Type:       Warning

    Event Source:     <HBA_Name>

    Event Category:   None

    Event ID:         129

    Date:             4/9/2009

    Time:             1:15:45 AM

    User:             N/A

    Computer:         <Computer_Name>

    Description:

    Reset to device, \Device\RaidPort1, was issued.

     

    So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

     

    Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

     

    The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

    • Providing timing services for requests.
    • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
    • Building scatter gather arrays for data buffers.

    The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

     

    When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

     

    The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

     

    The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

    image002

    Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

     

    The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

     

    I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

  • Ntdebugging Blog

    The Mystery of Lsass.exe Memory Consumption, (When all components get involved)

    • 4 Comments

    Hi All, this is Karim Elsaid and I’m a Support Escalation Engineer working with the Dubai platforms support team.  Recently I was working on a very challenging and interesting case, and I wanted to share that experience with you.

     

    One of our customers was experiencing a problem on all his Domain Controllers (Running x86 Windows Server 2003), where intermittently the DCs ran out of resources.  When one of the servers ran out of resources it had to be restarted to restore its functionality.  The following event was logged on the DC when the problem occurred:

     

    Event Type: Error
    Event Source: NTDS General
    Event Category: Internal Processing
    Event ID: 1169
    Date:
    Time:
    User: N/A
    Computer:
    Description:
    Active Directory could not allocate the needed amount of memory.

    Memory (bytes):
    4568

    Active Directory will continue to operate, but may not function correctly.

    User Action
    Restart this domain controller. If this condition continues, increase the available
    physical or virtual memory.

    Additional Data
    Internal ID:
    30205bd
    For more information, see Help and Support Center at
    http://go.microsoft.com/fwlink/events.asp.

     

    This event gets logged when the Local Security Authority Subsystem (lsass.exe) process is unable to allocate memory from the heap.

     

    When this problem occurs, users are not able to authenticate and Exchange servers cease functioning until the DC gets rebooted.

     

    By the time we got the call, the DC that was exhibiting the issue had been rebooted, hence when we checked the lsass.exe memory usage it was very low, but when we checked another DC on the same site, the lsass.exe memory usage was around 1.7GB.  Well, that’s pretty high but it is worth mentioning that by default lsass.exe may utilize a large amount of virtual memory for the ESENT Cache (Active Directory database), nevertheless this should not cause any issues on DCs as it should back-off when needed.

     

    The issue here is that the memory usage for lsass.exe kept growing and never got released back again - this is called a process memory leak.

     

    To confirm our theory we obtained a Perfmon Log from the DC and checked how Lsass.exe is doing with memory, which proved the theory that lsass.exe is actually leaking memory.

    Performance Monitor

     

    We can see from the graph above a steady increase in lsass.exe private bytes.  The DC was booted without /3GB and hence the limit of the user mode process address space is 2GB.

     

    For more information on this topic, please check the previous post “The Memory Shell Game

     

    Now, we need to explore the options on how to track and address such leaks. We could obtain an lsass.exe process dump and check the heap, or obtain a Full memory dump of the server while in the failed state, to check the lsass.exe heap usage and overall system state, but the thing is none of the above may immediately reveal what’s going on.  We would need to be very lucky to get the dump while the actual memory call that happens inside lsass.exe process is executing, which is very slim.

     

    What we are looking for is a “pattern”, the pattern which is causing the actual leak inside the lsass.exe process.

     

    For that we will turn to a tool that I see many of the customers are not aware of, but in fact it is very handy and easy to troubleshoot such a memory leak problem. We will use the “User Mode Dump Heap” or UMDH. By using this tool it will do the job for us, it will monitor the all heap calls that happen in the lsass.exe process.  All we need to do is to download the “Debugging Tools For Windows”, enable user mode stack acquisition from the command line, download the symbols from Microsoft public symbol servers, reboot the server and we are good to go.

     

    You can find information on UMDH; how to set it up and use it “Here

     

    So now, all heap blocks that the lsass.exe process allocates will get monitored by UMDH, and the tool will tell us information about that specific call stack.  For example, the number of allocations using this stack, number of bytes consumed, etc…

     

    Back to our main problem: We have setup the DC for UMDH and started monitoring the lsass.exe process.  We verified UMDH is working after the reboot by generating a sample log -  it is very critical to generate this simple log to make sure that you are able to resolve the symbols correctly.

     

    We kept generating UMDH logs daily and then the interesting pattern started to appear, which shows:

     

    + 123097760 ( 123098480 -    720) 1538731 allocs      BackTrace4BA0

    + 1538722 ( 1538731 -      9) BackTrace4BA0     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          msvcrt!malloc+0000006C

          NTDSA!ABBind_local+0000031C

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From the backtrace 4BA0 we can see that there are ~1.5 heap allocation calls that have occurred without being freed from the first day we started to monitor lsass using UMDH.

     

    There is also another call stack with exactly the same symptoms:

     

    + 51525199 ( 51525506 -    307) 1538802 allocs  BackTrace4B92

    + 1538793 ( 1538802 -      9) BackTrace4B92     allocations

     

          ntdll!RtlAllocateHeapSlowly+00000041

          ntdll!RtlAllocateHeap+00000E9F

          LSASRV!LsapAllocatePrivateHeapNoZero+0000004F

          RPCRT4!operator new+0000000D

          RPCRT4!W2AAttachHelper+00000032

          RPCRT4!CNlDelUnicodeAnsi::Convert+00000014

          NTDSA!ABBind_local+0000005D

          NTDSA!NspiBind+0000013D

          RPCRT4!Invoke+00000030

          RPCRT4!NdrStubCall2+00000299

          RPCRT4!NdrServerCall2+00000019

          RPCRT4!DispatchToStubInCNoAvrf+00000038

          RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0000011F

          RPCRT4!RPC_INTERFACE::DispatchToStub+000000A3

          RPCRT4!OSF_SCALL::DispatchHelper+00000149

          RPCRT4!OSF_SCALL::DispatchRPCCall+0000010D

          RPCRT4!OSF_SCALL::ProcessReceivedPDU+0000057F

          RPCRT4!OSF_SCALL::BeginRpcCall+00000194

          RPCRT4!OSF_SCONNECTION::ProcessReceiveComplete+00000435

          RPCRT4!ProcessConnectionServerReceivedEvent+00000021

          RPCRT4!LOADABLE_TRANSPORT::ProcessIOEvents+000001B8

          RPCRT4!ProcessIOEventsWrapper+0000000D

          RPCRT4!BaseCachedThreadRoutine+0000009D

          RPCRT4!ThreadStartRoutine+0000001B

          kernel32!BaseThreadStart+00000034

     

    From both call stacks above we can see that the originating function that leads to the heap allocation is “NTDSA!NspiBind

     

    NSPI is the interface which the Exchange Address Book provider uses to talk to the Global Catalogs. Typically when Outlook clients want to talk to the directory for address book, they talk to the Exchange server directly and then the Exchange server will look up the appropriate GC, forward the request to it and finally send the result back to the originating clients.  This method is called the DSproxy method.  Outlook clients can also send referral requests (RFR) to the Exchange server to get a referral to a GC and then start talking to this GC directly.  You can find more information about this operation “Here

     

     

    With this new information in hand, we turned our attention to the Exchange server for a while. We did so by obtaining a netmon trace from the Exchange server, and started examining it to reveal a massive number of RFR requests originating from outlook client machines toward the Exchange server:

    3795  11:50:59.1426170  10.3876170        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:535, TCP:488, IPv4:118}

    3796  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:502, TCP:496, IPv4:326}

    3797  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:509, TCP:508, IPv4:326}

    3798  11:50:59.1426170  10.3876170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:494, TCP:484, IPv4:326}

    9807  11:51:13.1894920  24.4344920        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1144, TCP:1142, IPv4:326}

    10084 11:51:13.3769920  24.6219920        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:1145, TCP:1143, IPv4:118}

    10085 11:51:13.3769920  24.6219920        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:1145, TCP:1143, IPv4:118}

    10086 11:51:13.3926170  24.6376170        EXCHSERVER 10.208.164.176    NSPI      NSPI:NspiBind Response, ReturnValue=0x0   {MSRPC:1144, TCP:1142, IPv4:326}

    10103 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1182, TCP:1180, IPv4:326}

    10108 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1151, TCP:1149, IPv4:326}

    10111 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1200, TCP:1196, IPv4:326}

    10115 11:51:13.3926170  24.6376170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1164, TCP:1163, IPv4:326}

    10148 11:51:13.4863670  24.7313670        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10149 11:51:13.4863670  24.7313670        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10172 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1199, TCP:1197, IPv4:326}

    10174 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1187, TCP:1186, IPv4:326}

    10180 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1170, TCP:1168, IPv4:326}

    10183 11:51:13.4863670  24.7313670        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1153, TCP:1152, IPv4:326}

    10185 11:51:13.4863670  24.7313670        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10186 11:51:13.4863670  24.7313670        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    10224 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1209, TCP:1208, IPv4:326}

    10227 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1192, TCP:1190, IPv4:326}

    10232 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1159, TCP:1157, IPv4:326}

    10235 11:51:13.5176170  24.7626170        10.208.164.176    EXCHSERVER NSPI      NSPI:NspiBind Request, Flags=0x0    {MSRPC:1175, TCP:1173, IPv4:326}

    10240 11:51:13.5176170  24.7626170        10.34.51.69 EXCHSERVER NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:447, TCP:446, IPv4:367}

    10241 11:51:13.5176170  24.7626170        EXCHSERVER GC_SERVER   NSPI  NSPI:NspiBind Request, Flags=0x0      {MSRPC:473, TCP:448, IPv4:69}

    10251 11:51:13.5332420  24.7782420        GC_SERVER   EXCHSERVER NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:473, TCP:448, IPv4:69}

    10255 11:51:13.5332420  24.7782420        EXCHSERVER 10.34.51.69 NSPI  NSPI:NspiBind Response, ReturnValue=0x0     {MSRPC:447, TCP:446, IPv4:367}

    Snipp…..

     

    In the netmon capture we saw an excessive number of NspiBind requests been sent from the Outlook client machines to the Exchange server and Exchange server in turn sending these NspiBind requests to the GC, in the netmon capture we didn’t find any frames for NspiUnBind requests, which should be used to destroy that connection.

     

    When we checked the Exchange server security logs we found a massive number of logon events that are occurring for every single client:

    Event Type: Success Audit

    Event Source:     Security

    Event Category:   (2)

    Event ID:   540

    Date:       5/26/2010

    Time:       3:49:09 PM

    User:      

    Computer:   EXCHSERVER

    Description:

    Successful Network Logon:

          User Name:  User_Name

          Domain:           CONTOSO.COM

          Logon ID:         (0x0,0x3DA1F42)

          Logon Type: 3

          Logon Process:    Kerberos

          Authentication Package: Kerberos

          Workstation Name:

          Logon GUID: {b03f9ad2-3150-e1be-f50a-b3abb9d2c09d}

          Caller User Name: -

          Caller Domain:    -

          Caller Logon ID:  -

          Caller Process ID: -

          Transited Services: -

          Source Network Address: 10.11.12.13

          Source Port:      0

     

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

     

    So, now we know the problem of lsass.exe memory consumption on the DCs is actually occurring because of a massive number of NspiBind requests that are being sent to the DCs without proper NspiUnbind requests.

     

    So we turned our attention to the client machines that have Outlook installed and started to list all 3rd party add-ins that are installed on them.

     

    We found that all affected client machines are actually sharing 3rd party add-ins for email content archival. Further investigation on this software revealed a strange MAPI Wrapper DLL library, and our customer confirmed that the start time of the problem approximately matches the time they have rolled out the content archival software on the client machines.

     

    We uninstalled this content archival add-in from a couple of machines and these machines stopped hammering the Exchange Server and GC with NspiBind requests.

     

    Later we uninstalled it from the rest of the client machines and the lsass.exe process on the DCs has finally stabilized and we no longer have any issues with its memory usage.

     

    I’d like to mention that starting with Windows Server 2008 we have implemented a “NSPI Max Sessions Per User” where you can limit the number of NSPI requests from a client to the server.

     

    You can read more information about this new functionality “Here

     

    That’s it! I hope you enjoyed reading and hopefully this should help you to troubleshoot any memory leaks you may encounter.  The information contained here can be applied to other processes when you suspect heap leaks.

     

    See you again on another adventure.

     

    Best Regards,

    Karim Elsaid

  • Ntdebugging Blog

    Control Panel Policy and the Missing Icon

    • 1 Comments

    A customer recently reported a rather peculiar problem.  They were working on a roll out of Windows 7, and one of the policies they employed on their domain was the “Show only specified control panel applets” setting.  As its name implies, this policy allows an administrator to specify the names of any Control Panel icons they want their users to see, and all others are not shown.  This company chose to allow a few of the Control Panel items included with Windows as well as the Mail icon that is added when Microsoft Office (specifically Outlook) is installed, which allows users to configure their Outlook profiles.

     

    The problem the customer noticed was that, with the policy was in place, the first time a user opened the Control Panel on a computer, all of the allowed icons appeared except for Mail.  If the user closed Control Panel and reopened it, the icon suddenly appeared.  In fact, from that point on, the Mail icon would appear for that user on that computer.  However, if that user logged on to a different computer, or a different user logged in to the first user’s computer, the icon was missing until Control Panel was opened for a second time.

     

    Figure 1: Policy setting on our test system

    Figure 1: Policy setting on our test system

     

    Figure 2: Result from first open

    Figure 2: Result from first open

     

    Figure 3: Result on subsequent opens (desired)

    Figure 3: Result on subsequent opens (desired)

     

    Control Panel isn’t something we get many cases on, so I had to figure out how it worked, and what would cause it to fail only the first time.  One of the first things I discovered as I debugged the issue was that the Control Panel is actually loaded in two passes – a “fast” pass and a “slow” pass.  This is done so icons we know will be loaded quickly can be displayed first so a user doesn’t get stuck with an empty window waiting for some of the slower icons to load.  What makes an icon fast or slow?  It turns out shell maintains a cache of Control Panel items, so it can display them quickly without having to query the actual item for its name, description, icon, etc.  If we have to ask the item for its information, that will take longer, so it is loaded during the second, slow phase:

     

    shell32!CControlPanelDataWorkItem::_LoadSlowData

    shell32!CControlPanelDataWorkItem::DoWork

    shell32!CFrameTask::InternalResumeRT

    shell32!CRunnableTask::Run

    shell32!CShellTask::TT_Run

    shell32!CShellTaskThread::ThreadProc

    shell32!CShellTaskThread::s_ThreadProc

    shlwapi!ExecuteWorkItemThreadProc

    ntdll!RtlpTpWorkCallback

    ntdll!TppWorkerThread

    kernel32!BaseThreadInitThunk

    ntdll!__RtlUserThreadStart

    ntdll!_RtlUserThreadStart

     

    Luckily for these icons, they won’t stay relegated to slow loading status forever.  During the slow loading phase, one of the things we do is add the item’s information to the Control Panel cache, so we can load it during the fast phase in the future.

     

    shell32!CPLD_AddControlToReg

    shell32!CControlPanelEnum::_NextNonCachedCpl

    shell32!CControlPanelEnum::Next

    shell32!CRegFolderEnum::Next

    shell32!CControlPanelAppletList::_AddAppletsToCategories

    shell32!CControlPanelAppletList::LoadSlowApplets

    shell32!CControlPanelDataWorkItem::_LoadSlowData

    shell32!CControlPanelDataWorkItem::DoWork

    shell32!CFrameTask::InternalResumeRT

    shell32!CRunnableTask::Run

    shell32!CShellTask::TT_Run

    shell32!CShellTaskThread::ThreadProc

    shell32!CShellTaskThread::s_ThreadProc

    shlwapi!ExecuteWorkItemThreadProc

    ntdll!RtlpTpWorkCallback

    ntdll!TppWorkerThread

    kernel32!BaseThreadInitThunk

    ntdll!__RtlUserThreadStart

    ntdll!_RtlUserThreadStart

     

    This explains why the item was being loaded on subsequent opens of Control Panel – the item was added to the user’s cache on the first run, so the next time, it was loaded during the fast pass:

    shell32!CControlPanelEnum::_NextCachedCpl

    shell32!CControlPanelEnum::Next

    shell32!CRegFolderEnum::Next
    shell32!CControlPanelAppletList::_AddAppletsToCategories

    shell32!CControlPanelAppletList::EnsureLoaded

    shell32!CControlPanelDataWorkItem::_LoadFastData

    shell32!CControlPanelDataWorkItem::DoWork

    shell32!CFrameTask::InternalResumeRT

    shell32!CRunnableTask::Run

    shell32!CShellTask::TT_Run

    shell32!CShellTaskThread::ThreadProc

    shell32!CShellTaskThread::s_ThreadProc

    shlwapi!ExecuteWorkItemThreadProc

    ntdll!RtlpTpWorkCallback

    ntdll!TppWorkerThread

    kernel32!BaseThreadInitThunk

    ntdll!__RtlUserThreadStart

    ntdll!_RtlUserThreadStart

     

    But that only answers half the question – while we now know why the icon appears every other time, we still don’t know why it doesn’t show up the first time.  To figure this out, I debugged a system with the “Show only specified control panel applets” policy applied and one without the policy set, both as they opened Control Panel for the first time, and compared where the return codes from the functions diverged. 

     

    Starting with CControlPanelEnum::_NextNonCachedCpl, in the non-working case, we can see that it returns 0:

    0:003> bp shell32!CControlPanelEnum::_NextNonCachedCpl

    0:003> g

    Breakpoint 0 hit

    eax=14c5fa8c ebx=14c5fa7c ecx=11224630 edx=00000000 esi=11224630 edi=14c5fa8c

    eip=76977e11 esp=14c5f9b4 ebp=14c5f9c8 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::_NextNonCachedCpl:

    76977e11 8bff            mov     edi,edi

    0:020> k

    ChildEBP RetAddr 

    14c5f9b0 76977dff shell32!CControlPanelEnum::_NextNonCachedCpl

    14c5f9c8 768a854d shell32!CControlPanelEnum::Next+0x5a

    14c5fa18 769710d2 shell32!CRegFolderEnum::Next+0x22e

    14c5fa94 769783e7 shell32!CControlPanelAppletList::_AddAppletsToCategories+0x35

    14c5fac4 76978353 shell32!CControlPanelAppletList::LoadSlowApplets+0x68

    14c5faec 76974db4 shell32!CControlPanelDataWorkItem::_LoadSlowData+0x2e

    14c5fb00 769e06ad shell32!CControlPanelDataWorkItem::DoWork+0x4b

    14c5fb18 768d83a6 shell32!CFrameTask::InternalResumeRT+0x14

    14c5fb38 7691639b shell32!CRunnableTask::Run+0xce

    14c5fb54 76918c1f shell32!CShellTask::TT_Run+0x167

    14c5fb9c 76918d53 shell32!CShellTaskThread::ThreadProc+0xa3

    14c5fba4 760cb2b1 shell32!CShellTaskThread::s_ThreadProc+0x1b

    14c5fbb4 77c8d877 shlwapi!ExecuteWorkItemThreadProc+0xe

    14c5fc28 77c90842 ntdll!RtlpTpWorkCallback+0x11d

    14c5fd80 00000000 ntdll!TppWorkerThread+0x572

    0:020> gu  <-- This steps out to the instruction after the call to CControlPanelEnum::_NextNonCachedCpl in CControlPanelEnum::Next.

    ModLoad: 35c70000 35c8b000   c:\progra~1\common~1\system\msmapi\1033\mlcfg32.cpl

    eax=00000000 ebx=14c5fa7c ecx=bea347f1 edx=006d2904 esi=11224630 edi=14c5fa8c

    eip=76977dff esp=14c5f9bc ebp=14c5f9c8 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::Next+0x5a:

    76977dff 84c0            test    al,al

     

    But on the machine without the policy set, it returns 1:

    eax=0c83fc34 ebx=0c83fc24 ecx=04ab1060 edx=00000000 esi=04ab1060 edi=0c83fc34

    eip=76977e11 esp=0c83fb5c ebp=0c83fb70 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::_NextNonCachedCpl:

    76977e11 8bff            mov     edi,edi

    0:012> k

    ChildEBP RetAddr 

    0c83fb58 76977dff shell32!CControlPanelEnum::_NextNonCachedCpl

    0c83fb70 768a854d shell32!CControlPanelEnum::Next+0x5a

    0c83fbc0 769710d2 shell32!CRegFolderEnum::Next+0x22e

    0c83fc3c 769783e7 shell32!CControlPanelAppletList::_AddAppletsToCategories+0x35

    0c83fc6c 76978353 shell32!CControlPanelAppletList::LoadSlowApplets+0x68

    0c83fc94 76974db4 shell32!CControlPanelDataWorkItem::_LoadSlowData+0x2e

    0c83fca8 769e06ad shell32!CControlPanelDataWorkItem::DoWork+0x4b

    0c83fcc0 768d83a6 shell32!CFrameTask::InternalResumeRT+0x14

    0c83fce0 7691639b shell32!CRunnableTask::Run+0xce

    0c83fcfc 76918c1f shell32!CShellTask::TT_Run+0x167

    0c83fd44 76918d53 shell32!CShellTaskThread::ThreadProc+0xa3

    0c83fd4c 760cb2b1 shell32!CShellTaskThread::s_ThreadProc+0x1b

    0c83fd5c 77c8d877 shlwapi!ExecuteWorkItemThreadProc+0xe

    0c83fdd0 77c90842 ntdll!RtlpTpWorkCallback+0x11d

    0c83ff30 778b3c45 ntdll!TppWorkerThread+0x572

    0c83ff3c 77cc37f5 kernel32!BaseThreadInitThunk+0xe

    0c83ff7c 77cc37c8 ntdll!__RtlUserThreadStart+0x70

    0c83ff94 00000000 ntdll!_RtlUserThreadStart+0x1b

    0:012> gu

    ModLoad: 35c70000 35c8b000   c:\progra~1\common~1\system\msmapi\1033\mlcfg32.cpl

    eax=00000001 ebx=0c83fc24 ecx=bea347f1 edx=00002bba esi=04ab1060 edi=0c83fc34

    eip=76977dff esp=0c83fb64 ebp=0c83fb70 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::Next+0x5a:

    76977dff 84c0            test    al,al

     

    So we know something in this function is causing us to return FALSE in the failing case and TRUE in the working case.  An easy way to see where the return result of function calls is the wt (watch trace) debugger command with the -oR switch to see the return values.  I set the level (-l) to 1 so I’d only see functions directly called from CControlPanelEnum::_NextNonCachedCpl.  Here were the results.

     

    Failing case:

     

    eax=14c5fa8c ebx=14c5fa7c ecx=11224630 edx=00000000 esi=11224630 edi=14c5fa8c

    eip=76977e11 esp=14c5f9b4 ebp=14c5f9c8 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::_NextNonCachedCpl:

    76977e11 8bff            mov     edi,edi

    0:020> wt -oR -l 1

    Tracing shell32!CControlPanelEnum::_NextNonCachedCpl to return address 76977dff

       31     0 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       10     0 [  1]   shell32!DSA_GetItemPtr

       17     0 [  1]   comctl32!DSA_GetItemPtr eax = 1123aa78

       48    27 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

    ModLoad: 35c70000 35c8b000   c:\progra~1\common~1\system\msmapi\1033\mlcfg32.cpl

       34     0 [  1]   shell32!CPLD_InitModule eax = 1

       55    61 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       15     0 [  1]   shell32!CDPA_Base<RESULT_FOLDER_ITEM,CTContainer_PolicyUnOwned<RESULT_FOLDER_ITEM> >::AppendPtr eax = 0

       59    76 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       14     0 [  1]   shell32!CPL_AddModuleReference eax = 0

       61    90 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       21     0 [  1]   shell32!CPL_ReleaseModuleReference eax = 0

       76   111 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

      128     0 [  1]   shell32!CPLD_AddControlToReg eax = 1

       78   239 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       12     0 [  1]   shell32!FindCPLModuleInList eax = c666998

       84   251 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       10     0 [  1]   shell32!DSA_GetItemPtr

       17     0 [  1]   comctl32!DSA_GetItemPtr eax = c5e7410

       91   278 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       90     0 [  1]   shell32!CControlPanelEnum::_CanEnumerateApplet eax = 0

       97   368 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       21     0 [  1]   shell32!CPL_ReleaseModuleReference eax = 0

      108   389 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

      143     0 [  1]   shell32!CPLD_FlushRegModules eax = 0

      116   532 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

        3     0 [  1]   shell32!__security_check_cookie eax = 0

      118   535 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

     

    Working case:

     

    eax=0c83fc34 ebx=0c83fc24 ecx=04ab1060 edx=00000000 esi=04ab1060 edi=0c83fc34

    eip=76977e11 esp=0c83fb5c ebp=0c83fb70 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelEnum::_NextNonCachedCpl:

    76977e11 8bff            mov     edi,edi

    0:012> wt -oR -l 1

    Tracing shell32!CControlPanelEnum::_NextNonCachedCpl to return address 76977dff

       31     0 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       10     0 [  1]   shell32!DSA_GetItemPtr

       17     0 [  1]   comctl32!DSA_GetItemPtr eax = 4a732c8

       48    27 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

    ModLoad: 35c70000 35c8b000   c:\progra~1\common~1\system\msmapi\1033\mlcfg32.cpl

       34     0 [  1]   shell32!CPLD_InitModule eax = 1

       55    61 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       15     0 [  1]   shell32!CDPA_Base<RESULT_FOLDER_ITEM,CTContainer_PolicyUnOwned<RESULT_FOLDER_ITEM> >::AppendPtr eax = 0

       59    76 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       14     0 [  1]   shell32!CPL_AddModuleReference eax = 0

       61    90 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       21     0 [  1]   shell32!CPL_ReleaseModuleReference eax = 0

       76   111 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

      128     0 [  1]   shell32!CPLD_AddControlToReg eax = 1

       78   239 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       12     0 [  1]   shell32!FindCPLModuleInList eax = 2ff3968

       84   251 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       10     0 [  1]   shell32!DSA_GetItemPtr

       17     0 [  1]   comctl32!DSA_GetItemPtr eax = 4aa81a8

       91   278 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

      111     0 [  1]   shell32!CControlPanelEnum::_CanEnumerateApplet eax = 1

      104   389 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

       21     0 [  1]   shell32!CPL_ReleaseModuleReference eax = 0

      112   410 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

        3     0 [  1]   shell32!__security_check_cookie eax = 1

      114   413 [  0] shell32!CControlPanelEnum::_NextNonCachedCpl

     

    Here, we can see that the function CControlPanelEnum::_CanEnumerateApplet returns true in the working case, but false in the failing case.  This is what is causing the return value from _NextNonCachedCpl to differ, which is ultimately what causes the icon to load/not load.  So let’s take a look at _CanEnumerateApplet.  Using wt inside this function showed something rather interesting:

     

    Failing:

     

    eax=0c5e7410 ebx=769298fd ecx=11224630 edx=00000000 esi=11224630 edi=0c666998

    eip=7693a72f esp=14c5f980 ebp=14c5f9b0 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202

    shell32!CControlPanelEnum::_CanEnumerateApplet:

    7693a72f 8bff            mov     edi,edi

    0:020> wt -oR -l 1

    Tracing shell32!CControlPanelEnum::_CanEnumerateApplet to return address 7693a68e

       27     0 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

      737     0 [  1]   shlwapi!PathFindFileNameW eax = c6669f8

       40   737 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       59     0 [  1]   shell32!IDControlCreate eax = 0

       44   796 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       43     0 [  1]   shlwapi!SHWindowsPolicy eax = 1

       53   839 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        2     0 [  1]   shell32!CRegFolder::QueryInterface

        5     0 [  1]   shell32!CKnownFoldersFolder::QueryInterface

       12     0 [  1]   shell32!CAggregatedUnknown::QueryInterface eax = 0

       71   858 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       48     0 [  1]   shell32!CRegFolder::GetDetailsEx eax = ffffffff`80070002

       77   906 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       24     0 [  1]   shell32!CRegFolder::Release eax = 3

       79   930 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        6     0 [  1]   shell32!ATL::CComVariant::Clear eax = 0

       88   936 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        3     0 [  1]   shell32!__security_check_cookie eax = 0

       90   939 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

     

    Working:

     

    0:012> wt -oR -l 1

    Tracing shell32!CControlPanelEnum::_CanEnumerateApplet to return address 7693a68e

       27     0 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

      737     0 [  1]   shlwapi!PathFindFileNameW eax = 2ff39c8

       40   737 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       59     0 [  1]   shell32!IDControlCreate eax = 0

       44   796 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       43     0 [  1]   shlwapi!SHWindowsPolicy eax = 1

       53   839 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        2     0 [  1]   shell32!CRegFolder::QueryInterface

        5     0 [  1]   shell32!CKnownFoldersFolder::QueryInterface

       12     0 [  1]   shell32!CAggregatedUnknown::QueryInterface eax = 0

       71   858 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       48     0 [  1]   shell32!CRegFolder::GetDetailsEx eax = 0

       77   906 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       28     0 [  1]   shell32!CPL_DoesPolicyAllow eax = 1

       82   934 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

       24     0 [  1]   shell32!CRegFolder::Release eax = 3

       84   958 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        6     0 [  1]   shell32!ATL::CComVariant::Clear eax = 0

       98   964 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

      358     0 [  1]   shell32!CControlPanelFolder::GetCustomAttributes eax = 0

      109  1322 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

        3     0 [  1]   shell32!__security_check_cookie eax = 1

      111  1325 [  0] shell32!CControlPanelEnum::_CanEnumerateApplet

     

    Looking at the two, we can see that they both call GetDetailsEx, but in the failing case we seem to get an error result: 80070002, and in the working case we get 0, and then proceed to call CPL_DoesPolicyAllow.  CPL_DoesPolicyAllow is not called on the machine with the policy applied.  A quick code review of CPL_DoesPolicyAllow showed that this function checks if a policy prevents or allows an item from being displayed.  So if we can just get to that function, the icon will load.  Now we need to know why GetDetailsEx fails, preventing us from calling CPL_DoesPolicyAllow.   If we look up the error code, it is pretty generic:

     

    0:020> !error 80070002

    Error code: (HRESULT) 0x80070002 (2147942402) - The system cannot find the file specified.

     

    Next I stepped through and into GetDetailsEx.  I’ll save you the tedious steps and output, since the procedure is similar to what I’ve already shown above.  Anyway, I was able to trace the error 2 as being returned by a call to the registry a few functions deep into GetDetailsEx:

     

    kernel32!RegOpenKeyExW

    shell32!CControlPanelFolder::_GetExtPropRegKey

    shell32!CControlPanelFolder::_InitExtPropRegValNameCache

    shell32!CControlPanelFolder::_PropertyFromPidl

    shell32!CControlPanelFolder::GetDetailsEx

    shell32!CRegFolder::GetDetailsEx

    shell32!CControlPanelEnum::_CanEnumerateApplet

    shell32!CControlPanelEnum::_NextNonCachedCpl

     

    This is good news, because it means we can find out what we’re looking for that we aren’t finding.  First I stepped out to get back to CControlPanelFolder::_InitExtPropRegValNameCache.  Sure enough, eax was 2, which is our ‘cannot find the file’ error.

     

    0:020> gu

    eax=00000002 ebx=00000000 ecx=778ac2da edx=00000002 esi=14c5eff0 edi=00000000

    eip=76975ead esp=14c5ed40 ebp=14c5efc4 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    shell32!CControlPanelFolder::_GetExtPropRegKey+0xb1:

    76975ead 50              push    eax

    0:020> !error 2

    Error code: (Win32) 0x2 (2) - The system cannot find the file specified

     

    Since we know that RegOpenKeyEx takes the subkey it is looking for as the second parameter, let’s look back at the assembly before the call to see what we passed in:

     

    0:020> ub @eip

    shell32!CControlPanelFolder::_GetExtPropRegKey+0x98:

    76975e94 7c1f            jl      shell32!CControlPanelFolder::_GetExtPropRegKey+0xb9 (76975eb5)

    76975e96 56              push    esi

    76975e97 6a01            push    1

    76975e99 53              push    ebx

    76975e9a 8d8590fdffff    lea     eax,[ebp-270h]

    76975ea0 50              push    eax

    76975ea1 ffb588fdffff    push    dword ptr [ebp-278h]

    76975ea7 ff15a0128776    call    dword ptr [shell32!_imp__RegOpenKeyExW (768712a0)]

     

    Remember that function parameters are passed in in reverse order, so the second parameter is the second to last value passed to the function.  In this case, we can see that it was stored in eax.  Of course, eax has been overwritten by our return value of 2, but we can see that just before pushing eax, we get the value from ebp-0x270.  Dumping that out as a Unicode string, we get the key we were looking for:

     

    0:020> du @ebp-270h

    14c5ed54  "Software\Microsoft\Windows\Curre"

    14c5ed94  "ntVersion\Control Panel\Extended"

    14c5edd4  " Properties\System.ApplicationNa"

    14c5ee14  "me"

     

    The first parameter is the root key, which was:

     

    0:020> dc @ebp-278h L1

    14c5ed4c  80000002 

     

    We can look up this value in the Windows SDK header files and see that it is for HKLM:

    #define HKEY_LOCAL_MACHINE                  (( HKEY ) (ULONG_PTR)((LONG)0x80000002) )

     

    Aha!  So now we know we were trying to access “HKLM\ Software\Microsoft\Windows\CurrentVersion\Control Panel\Extended

     Properties\System.ApplicationName” which didn’t exist, so we were unable to proceed with loading the icon. 

     

    What’s that key for, you ask?  MSDN tells us that it is used to store the Canonical Name of the Control Panel item.  Canonical Names were added to the Control Panel in Windows Vista as a better way to organize and name the items in the Control Panel.  By convention, the canonical name of “Mail” is “Microsoft.Office.Outlook.Profiles.”  Because the “Show only specified control panel applets” policy can now accept either the legacy name (“Mail”) or the canonical name, we are checking the registry for this string.  Since the key is not present, GetDetailsEx cannot determine the canonical name, and the icon does not load.

     

    As a test, I created the key HKLM\ Software\Microsoft\Windows\CurrentVersion\Control Panel\Extended

     Properties\System.ApplicationName and added a REG_SZ to it with a name of the file location of the Mail CPL file ("C:\Program Files\Common Files\SYSTEM\MSMAPI\1033\MLCFG32.CPL") and a value of "Microsoft.Office.Outlook.Profiles".  Sure enough, the icon loads up on the first try for new users.

     

    While new Control Panel items should implement a canonical name, that doesn’t work for existing CPLs, like Mail from Outlook 2003.  Since the canonical name isn’t actually required (CPL_DoesPolicyAllow works with both name formats), we’re following up with the development team to allow CPL_DoesPolicyAllow to be called even if a canonical name is not found.

     

    -Matt Burrough

  • Ntdebugging Blog

    Windows 7 XP Mode – Auto Publish ‘Not Available’

    • 3 Comments

    Hello All, Ron Riddle here again to share another mystery relating to the XP Mode Auto Publish feature for Windows 7 that had an unexpected root cause.  As you may know, the Auto Publish feature allows applications installed within the guest to be automagically available from the start menu of the Windows 7 host.  Of course, one requirement for this to work properly is that the feature must be enabled; however, I have encountered two such instances where the feature is listed as ‘Not Available’ within the Virtual PC Settings UI once the guest is started.

     

    image001

     

    Debugging the Issue

     

    The Virtual PC Integration Components Services Application (1-vmsrvc) service, which runs within a vmsrvc.exe instance on the guest, decides whether the Auto Publish feature should be enabled.  For non-Windows 7 guests, the first thing 1-vmsrvc does is issue a WMI query to determine whether KB961742 (for an XP guest) or KB961741 (for a Vista guest), which provide Remote Applications Integrated Locally (RAIL) support, have been applied.

     

    Root Cause Analysis

     

    Before we issue the WMI query, we must first activate the CLSID_WbemLevel1Login component.  Here’s an excerpt below:

     

    0:009> k

    ChildEBP RetAddr 

    00b1f730 74ef186e wbemprox!CDCOMTrans::DoActualCCI+0x3d

    00b1f774 74ef15db wbemprox!CDCOMTrans::DoCCI+0x12d

    00b1f830 74ef17e4 wbemprox!CDCOMTrans::DoActualConnection+0x25c

    00b1f85c 74ef1ee1 wbemprox!CDCOMTrans::DoConnection+0x25

    00b1f89c 01018283 wbemprox!CLocator::ConnectServer+0x7c

    00b1fae8 010182da vmsrvc!VPCRAILUpdates::Connect+0xa8

    00b1faf4 0101842b vmsrvc!VPCRAILUpdates::QueryInstalledFixes+0xb

    00b1fe34 0100b61f vmsrvc!VPCRAILUpdates::CheckIfUpdatesArePresent+0x91

    00b1ffb4 7c80b729 vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc+0x90

    00b1ffec 00000000 kernel32!BaseThreadStart+0x37

     

    However, I noticed that the activation attempt failed with WBEM_E_CRITICAL_ERROR(0x8004100a).

     

    0:009> r eax

    eax=8004100a

     

    So, I proceeded to debug the activation attempt from within the Windows Management Instrumentation(winmgmt) service, since it provides the class factory for this component.  Notice the call to LoadLibraryExW passing a relative path for the lpFileName parameter.  This means that a search strategy must be applied which leverages the PATH environment variable.

     

    0:002> k

    ChildEBP RetAddr             

    0086f438 594976e2 kernel32!LoadLibraryExW

    0086f494 7751d8a7 wmisvc!CForwardFactory::CreateInstance+0xf8

    0086f4b8 7751daac ole32!GetInstanceHelperMulti+0x20

    0086f578 77e799f4 ole32!CObjServer::CreateInstance+0x251

    0086f59c 77ef421a RPCRT4!Invoke+0x30

    0086f9a8 77ef4bf3 RPCRT4!NdrStubCall2+0x297

    0086fa00 77600c15 RPCRT4!CStdStubBuffer_Invoke+0xc6

    0086fa40 77600bbf ole32!SyncStubInvoke+0x33

    0086fa88 7752ad31 ole32!StubInvoke+0xa7

    0086fb60 7752ac56 ole32!CCtxComChnl::ContextInvoke+0xe3

    0086fb7c 776007f5 ole32!MTAInvoke+0x1a

    0086fbac 77602df3 ole32!AppInvoke+0x9c

    0086fc80 77600715 ole32!ComInvokeWithLockAndIPID+0x2c2

    0086fccc 77e794bd ole32!ThreadInvoke+0x1cd

    0086fd00 77e79422 RPCRT4!DispatchToStubInC+0x38

    0086fd54 77e7934e RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x113

    0086fd78 77e8a384 RPCRT4!RPC_INTERFACE::DispatchToStub+0x84

    0086fdb8 77e8a3c5 RPCRT4!RPC_INTERFACE::DispatchToStubWithObject+0xc0

    0086fdf8 77e7bcc1 RPCRT4!LRPC_SCALL::DealWithRequestMessage+0x2cd

    0086fe1c 77e7bc05 RPCRT4!LRPC_ADDRESS::DealWithLRPCRequest+0x16d

     

    0:002> du poi(esp+4)

    59491668  "wbemcore.dll"

     

    Surprisingly, I found that the requested library could not be found using the standard search strategy.

     

    0:002> !gle

    LastErrorValue: (Win32) 0x7e (126) - The specified module could not be found.

    LastStatusValue: (NTSTATUS) 0xc0000135 - {Unable To Locate Component}  This application has failed to start because %hs was not found. Re-installing the application may fix this problem.

     

    I then decided to enable boot logging within the Process Monitor tool to catch a glimpse into why the load for wbemcore.dll was failing.  Here’s an excerpt from the Process Monitor log:

     

    8:18:11.1652951 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\system32\wbemcore.dll  PATH NOT FOUND    

    8:18:11.1653627 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\wbemcore.dll        PATH NOT FOUND 

    8:18:11.1654161 PM      svchost2.exe      1628  QueryOpen   C:\WINDOWS\system32\%SystemRoot%\System32\Wbem\wbemcore.dll  PATH NOT FOUND     

     

    At this point, it became clear to me that something must be wrong with the PATH environment variable configuration because the SystemRoot environment variable was not being properly expanded.  Sure enough, the registry showed that the Path value was of type REG_SZ rather than REG_EXPAND_SZ.

     

    image003

     

    Once I saved off the data for the Path value and recreated it specifying the proper type(REG_EXPAND_SZ), the issue was resolved!

     

    Sidenote on the Debugging Strategy

     

    I chose to debug the services by attaching ntsd.exe and redirecting the session to the kernel debugger.  Alternatively, I could have chosen to use a remote debugging session since it’s much more natural than redirecting to the kernel debugger, in my opinion.  However, this issue was further complicated by the fact that introducing a debugger in the mix had potential to change the timing enough such that I struggled to reproduce the issue using a remote session.  The reason for this is that if we slow down the 1-vmsrvc execution enough, the winmgmt service itself will attempt to load wbemcore.dll, albeit through an activation request where an absolute path is specified, thereby avoiding application of a search strategy, which will succeed.  Thus, when 1-vmsrvc attempts to activate CLSID_WbemLevel1Login, it will now succeed because wbemcore.dll is already loaded and the Auto Publish feature will now be ‘Enabled’!

     

    Configuration

     

    I leveraged the Image File Execution Options key, creating a new entry for vmsrvc.exe and configuring the Debugger value with the following command line:

     

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\vmsrvc.exe]

    "Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\vmsrvc -c \"bu vmsrvc!Win32VPCAppPublisherService::AllowListNotificationThreadProc;bu wbemprox!CDCOMTrans::DoActualCCI;g\""

     

    I also isolated the winmgmt service into its own svchost.exe, copied %systemroot%\system32\svchost.exe to %systemroot%\system32\svchost2.exe, and then created a new key for svchost2.exe with the following command line:

     

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\svchost2.exe]

    "Debugger"="c:\\debuggers\\ntsd.exe -d -y srv*c:\\winmgmt -c \"bu wmisvc!CForwardFactory::CreateInstance;g\""

     

    Although redirecting ntsd.exe to the kernel debugger was cumbersome from a usability perspective, I found the ability to debug both processes from a central facility very appealing and worthwhile.

     

    Conclusion

     

    For issues like these that ultimately resolve to a misconfiguration of the OS, I can’t help but think how unfortunate it was that I didn’t stumble onto root cause sooner via the routine task of launching some executable from a command shell.  Surely this would have been a red flag and could have saved me a lot of time debugging!

     

    image005

     

    While this was a rather extreme example of how a misconfiguration of the OS can affect other seemingly unrelated parts such as the Auto Publish feature of XP Mode, the take-away here is when you detect that the Auto Publish feature is ‘Not Available’, you should begin with standard WMI troubleshooting; and, as we’ve just seen here, a quick sanity check of the environment might not be a bad idea either! :)

     

    Until next time, happy debugging!

  • Ntdebugging Blog

    How Queue Deadlocks Happen

    • 3 Comments

    I recently worked on an interesting system hang issue with a deadlocked work queue and wanted to share some information about how we resolved the issue.

     

    In this example, we will demonstrate how a cascaded work item can deadlock a work queue.  As you can see from the illustration, we have three worker threads.  Most work queues have 16 or more threads servicing them, but to simplify this example, three worker threads will be used.  In the first illustration, all worker threads are waiting for the queue’s event to be signaled (indicating that a work item was inserted into the queue).

     

    Thread ~ac0.001 is a process or service related thread that is currently in a running state (indicated by the color blue).  This thread needs some work to be accomplished in an arbitrary context.  To do this, the thread constructs an IO_WORK_ITEM and specifies a callback routine to be performed by the worker thread(s). 
     Fig. 1 – All worker threads available

    Fig. 1 – All worker threads available

     

    Once the required work items have been defined, our thread would call IoQueueWorkItem for each item.  This would signal the queue’s notification KEVENT that there is work to be done.  In figure 2, we can see the work items inserted into the queue (Fig. 2).

    Fig. 2 – Thread inserts work items into the queue

    Fig. 2 – Thread inserts work items into the queue

     

    At this point, the work queue’s event is signaled to wake up the worker(s).  Once running, each of the threads removes a work item from the queue and runs the function associated with the work item.

     

    Since each work item has a completion event, this event will be signaled when the worker is finished processing the item.   Because of this, thread ~ac0.001 would normally call KeWaitForMultipleObjects to wait for the work item completion events to be signaled (Fig. 3).

    Fig. 3 – Requesting thread waits for completion

    Fig. 3 – Requesting thread waits for completion

     

    Upon completion of a work item, each of the worker threads would normally signal the corresponding work item event(s) and then go back into a wait state for new work items and thread ~ac0.001 would continue execution.

     

    Now let’s look at what happens when workers need to enqueue one or more additional work items to the same queue.  This is not a problem as long as there is at least one available worker thread to service these cascaded work items.  The available worker(s) would simply handle these and eventually this situation would resolve itself (Fig. 4).

    Fig. 4 – Work items 4, 5, and 6 will be handled by Thread ~008.004

    Fig. 4 – Work items 4, 5, and 6 will be handled by Thread ~008.004

     

    If there are no more available workers, work items 4, 5, and 6 will never be handled.  This results in an unresolvable situation (Fig. 5).  No more items can be removed from the queue.

    Fig. 5 – No available worker threads to handle items 4, 5, and 6

    Fig. 5 – No available worker threads to handle items 4, 5, and 6

     

    The result is a deadlock on the queue (Fig. 6).  To avoid this type of bug, a developer must take cascaded work items into consideration.  It is not ideal for a worker routine to enqueue work to its own queue, but it is a common practice in the software industry.  To avoid deadlocks, a developer may want to consider using a separate work queue for these cascaded items.  If this is not possible, the developer should ensure that there are sufficient worker threads to handle any reasonable load scenario.

    Fig. 6 – Deadlocked work queue

    Fig. 6 – Deadlocked work queue

     

    In summary, exercise caution when writing your worker routines and be sure to do plenty of stress testing to ensure your driver’s design is feasible in real world scenarios.

     

     

    Example

    A few weeks ago, I received a memory dump from a customer who was reporting a daily system hang.

     

    One of the first things I do with a hang dump is to do a quick assessment with !VM, !Locks, and !exqueue 6.  If there's something obvious going on, those three debugger commands will usually get things moving in the right direction.

     

    In this case, the output of !exqueue 6 caught my eye.  In the Critical Work Queue, all critical system worker threads were blocking on a wait operation issued by backyard_driver.  It had called KeWaitForSingleObject and was waiting for notification events.  I also noticed a large number of pending work items for backyard_driver and there were two different worker routines in these items.

     

    PENDING: IoWorkItem backyard_driver+0x830768F2 (0001e228)

                                    IoObject (8823e148) Context (b40e4384)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e110)

                                    IoObject (8823e148) Context (b40be9f0)

     

    From looking at the thread stacks, I could tell that the waiting worker routines were from the worker routine at +0x1e110, and the pending items were for a worker routine starting at offset +0x1e228.

     

    8df6fb8c 96e3cccd 85801aa0 00000000 00000000 nt!KeWaitForSingleObject+0x393 (CONV: stdcall) 

    WARNING: Stack unwind information not available. Following frames may be wrong.

    8df6fbe4 96e3d5bd 88266698 00000103 8df6fc0c backyard_driver+0x1cccd

    8df6fbf4 82e924bc 88266698 8af1cb20 88267f00 backyard_driver+0x1d5bd

    8df6fc0c 96eeb0ac 82e924bc 88267f00 8af1cb20 nt!IofCallDriver+0x63 (CONV: fastcall) 

    8df6fc28 96f08bbc 96ef64d0 859c7220 859893c8 filterdriver1+0x50ac

    8df6fc40 96ef6516 85aea488 858b2e30 00000000 filterdriver2+0x5cc

    8df6fc60 96f07cd1 859893c8 00000000 00000000 filterdriver2+0x2516

    8df6fc7c 82ebeb33 88228c18 85afb628 859893c8 filterdriver2+0x2f1

    8df6fcc4 96e3e0fe b40d625c b40d615c 8df6fcec nt!IopfCompleteRequest+0x128 (CONV: fastcall) 

    8df6fcd4 96e3e14d b40d6100 8b1cbaa8 8823e148 backyard_driver+0x1e0fe

    8df6fcec 830767fd 8823e148 b40d615c 857524f8 backyard_driver+0x1e14d

    8df6fd00 82ec3f3b 8b1cbaa8 00000000 857524f8 nt!IopProcessWorkItem+0x23 (CONV: stdcall) 

    8df6fd50 830646bb 00000000 ae892e65 00000000 nt!ExpWorkerThread+0x10d (CONV: stdcall) 

    8df6fd90 82f160f9 82ec3e2e 00000000 00000000 nt!PspSystemThreadStartup+0x9e (CONV: stdcall) 

    00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19 

     

    This means that one worker routine was calling IoQueueWorkItem for the same queue and then waiting for completion.  This is fine as long as there are one or more worker threads available to process the cascaded work items, but in this case all threads were busy waiting for other work items in the same queue to complete.  The queue was essentially deadlocked.

     

    0: kd> !exqueue 6

    Dumping ExWorkerQueue: 82F90580

     

    **** Critical WorkQueue( current = 0 maximum = 4 )

    THREAD 857524f8  Cid 0004.0018  Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable

        85801aa0  NotificationEvent

    Not impersonating

    DeviceMap                 8cc084b0

    Owning Process            8568b600       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      30540          Ticks: 63029 (0:00:16:23.258)

    Context Switch Count      27640            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.327

    Win32 Start Address nt!ExpWorkerThread (0x82ec3e2e)

    Stack Init 8df6ffd0 Current 8df6fa98 Base 8df70000 Limit 8df6d000 Call 0

    Priority 13 BasePriority 13 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

    *** ERROR: Module load completed but symbols could not be loaded for backyard_driver.sys

    *** ERROR: Module load completed but symbols could not be loaded for filterdriver1.sys

    *** ERROR: Symbol file could not be found.  Defaulted to export symbols for filterdriver2.SYS -

    ChildEBP RetAddr  Args to Child              

    8df6fab0 82ec4b25 857524f8 00000000 82f7fd20 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4]) 

    8df6fae8 82ec3423 857525b8 857524f8 85801aa0 nt!KiSwapThread+0x266 (CONV: fastcall) 

    8df6fb10 82ebd2cf 857524f8 857525b8 00000000 nt!KiCommitThreadWait+0x1df (CONV: stdcall) 

    8df6fb8c 96e3cccd 85801aa0 00000000 00000000 nt!KeWaitForSingleObject+0x393 (CONV: stdcall) 

    WARNING: Stack unwind information not available. Following frames may be wrong.

    8df6fbe4 96e3d5bd 88266698 00000103 8df6fc0c backyard_driver+0x1cccd

    8df6fbf4 82e924bc 88266698 8af1cb20 88267f00 backyard_driver+0x1d5bd

    8df6fc0c 96eeb0ac 82e924bc 88267f00 8af1cb20 nt!IofCallDriver+0x63 (CONV: fastcall) 

    8df6fc28 96f08bbc 96ef64d0 859c7220 859893c8 filterdriver1+0x50ac

    8df6fc40 96ef6516 85aea488 858b2e30 00000000 filterdriver2+0x5cc

    8df6fc60 96f07cd1 859893c8 00000000 00000000 filterdriver2+0x2516

    8df6fc7c 82ebeb33 88228c18 85afb628 859893c8 filterdriver2+0x2f1

    8df6fcc4 96e3e0fe b40d625c b40d615c 8df6fcec nt!IopfCompleteRequest+0x128 (CONV: fastcall) 

    8df6fcd4 96e3e14d b40d6100 8b1cbaa8 8823e148 backyard_driver+0x1e0fe

    8df6fcec 830767fd 8823e148 b40d615c 857524f8 backyard_driver+0x1e14d

    8df6fd00 82ec3f3b 8b1cbaa8 00000000 857524f8 nt!IopProcessWorkItem+0x23 (CONV: stdcall) 

    8df6fd50 830646bb 00000000 ae892e65 00000000 nt!ExpWorkerThread+0x10d (CONV: stdcall) 

    8df6fd90 82f160f9 82ec3e2e 00000000 00000000 nt!PspSystemThreadStartup+0x9e (CONV: stdcall) 

    00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19 

     

    .

    <additional threads were found here with similar stacks>

    .

     

    THREAD 856be798  Cid 0004.01d0  Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable

        85bc4ac8  NotificationEvent

    Not impersonating

    DeviceMap                 8cc084b0

    Owning Process            8568b600       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      30540          Ticks: 63029 (0:00:16:23.258)

    Context Switch Count      1            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address nt!ExpWorkerThread (0x82ec3e2e)

    Stack Init bedddfd0 Current beddda98 Base bedde000 Limit beddb000 Call 0

    Priority 13 BasePriority 13 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

    ChildEBP RetAddr  Args to Child              

    bedddab0 82ec4b25 856be798 00000000 82f7fd20 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4]) 

    bedddae8 82ec3423 856be858 856be798 85bc4ac8 nt!KiSwapThread+0x266 (CONV: fastcall) 

    bedddb10 82ebd2cf 856be798 856be858 00000000 nt!KiCommitThreadWait+0x1df (CONV: stdcall) 

    bedddb8c 96e3cccd 85bc4ac8 00000000 00000000 nt!KeWaitForSingleObject+0x393 (CONV: stdcall) 

    WARNING: Stack unwind information not available. Following frames may be wrong.

    bedddbe4 96e3d5bd 88266698 00000103 bedddc0c backyard_driver+0x1cccd

    bedddbf4 82e924bc 88266698 85b2a8d8 88267f00 backyard_driver+0x1d5bd

    bedddc0c 96eeb0ac 82e924bc 88267f00 85b2a8d8 nt!IofCallDriver+0x63 (CONV: fastcall) 

    bedddc28 96f08bbc 96ef64d0 85b33218 8b056f00 filterdriver1+0x50ac

    bedddc40 96ef6516 85e27c90 85e2b9f0 00000000 filterdriver2+0x5cc

    bedddc60 96f07cd1 8b056f00 00000000 00000000 filterdriver2+0x2516

    bedddc7c 82ebeb33 88228c18 85aae320 8b056f00 filterdriver2+0x2f1

    bedddcc4 96e3e0fe b40e2ae8 b40e29e8 bedddcec nt!IopfCompleteRequest+0x128 (CONV: fastcall) 

    bedddcd4 96e3e14d b40e2900 859cdee0 8823e148 backyard_driver+0x1e0fe

    bedddcec 830767fd 8823e148 b40e29e8 856be798 backyard_driver+0x1e14d

    bedddd00 82ec3f3b 859cdee0 00000000 856be798 nt!IopProcessWorkItem+0x23 (CONV: stdcall) 

    bedddd50 830646bb 80000000 9da20e65 00000000 nt!ExpWorkerThread+0x10d (CONV: stdcall) 

    bedddd90 82f160f9 82ec3e2e 80000000 00000000 nt!PspSystemThreadStartup+0x9e (CONV: stdcall) 

    00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x19 

     

     

    PENDING: IoWorkItem backyard_driver+0x830768F2 (0001e228)

                                    IoObject (8823e148) Context (b40e4384)

     

    PENDING: IoWorkItem backyard_driver+0x830768F2 (0001e228)

                                    IoObject (8823e148) Context (b40e45b8)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e110)

                                    IoObject (8823e148) Context (b40be9f0)

     

    PENDING: IoWorkItem backyard_driver+0x830768F2 (0001e228)

                                    IoObject (8823e148) Context (b40e47ec)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e110)

                                    IoObject (8823e148) Context (b40e4d88)

     

    PENDING: IoWorkItem backyard_driver+0x830768F2 (0001e228)

                                    IoObject (8823e148) Context (882aea8c)

     

    .

    .

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e110)

                                    IoObject (8823e148) Context (85807bd4)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e228)

                                    IoObject (8823e148) Context (8580cca0)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e228)

                                    IoObject (8823e148) Context (85811e54)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e228)

                                    IoObject (8823e148) Context (8581713c)

     

    PENDING: IoWorkItem backyard_driver+0x830767da (0001e228)

                                    IoObject (8823e148) Context (8581c424)

     

    It turns out that this issue was triggered by a sudden surge of IO related to a PNP arrival.  To prevent the issue from occurring, we added a registry value to allow for additional critical worker threads to be created.

     

    HKLM\SYSTEM\CurrentControlSet\Control\SessionManager\Executive

    AdditionalCriticalWorkerThreads : 0x14

     

    For more information about this registry key, refer to http://msdn.microsoft.com/en-us/library/ee377058(BTS.10).aspx

     

    That’s all for now.

     

    Dennis Middleton

    “The NTFS Doctor”

     

  • Ntdebugging Blog

    Interpreting a WHEA error for a MCA fault

    • 3 Comments

    Howdy fellow debuggers! This is Graham McIntyre, I am an Escalation Engineer in Platforms Global Escalation Services.  We get questions from time to time from customers who experience a WHEA bugcheck 0x124, or system event, for help in interpreting the error record. The information applies to Windows Server 2008 / Vista SP1 and Windows 2008 R2 / Windows 7.

     

    I thought I would go through an example error record, point out some commonly asked questions, and show you how to find specific information on the error.  In many cases, the information is specific to a particular processor / hardware vendor, the customer will need to follow up with them. But, we can help to some extent to parse the data.

     

    For an initial primer on WHEA and hardware error reporting, I suggest reading this whitepaper:

    http://www.microsoft.com/whdc/system/pnppwr/WHEA/wheaintro.mspx

     

    I’ll provide some further links to some specific WHEA information along the way.

     

    Getting Started:

    A WHEA bug check 0x124, WHEA_UNCORRECTABLE_ERROR, indicates that a fatal hardware error has occurred.  The bug check parameters give you further information on the WHEA error record generated.

     

    In this example case, the first parameter was 0 so this indicates that this is a Machine Check Exception (MCE).  An MCE is generated by certain classes of processors, such as Intel and AMD 64-bit processors.

     

    Checking the help included with the Debugging Tools For Windows for Bug Ch 0x124 shows this meaning for the parameters:

    Parameter 1 Parameter 2 Parameter 3 Parameter 4 Cause of Error

    0x0

    Address of WHEA_ERROR_RECORD structure

    High 32 bits of MCi_STATUS MSR for the MCA bank that has the error.

    Low 32 bits of MCi_STATUS MSR for the MCA bank that has the error.

    A machine check exception occurred.

    These parameter descriptions apply if the processor is based on the x64 architecture, or the x86 architecture that has the MCA feature available (for example, Intel Pentium Pro, Pentium IV, or Xeon).

     

    There are 2 useful debugger commands for debugging a WHEA error:

    !whea – displays top level WHEA information

    !errrec – dumps a specific WHEA error record

     

    Since we already have an address of the error record in Parameter 2, we can dump it out directly with !errrec. 

    31: kd> !errrec fffffa8064341028
    ===============================================================================
    Common Platform Error Record @ fffffa8064341028
    -------------------------------------------------------------------------------
    Record Id     : 01cb65718c829130
    Severity      : Fatal (1)
    Length        : 928
    Creator       : Microsoft
    Notify Type   : Machine Check Exception
    Timestamp     : 10/11/2010 7:11:22
    Flags         : 0x00000000

    ===============================================================================
    Section 0     : Processor Generic
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa80643410a8
    Section       @ fffffa8064341180
    Offset        : 344
    Length        : 192
    Flags         : 0x00000001 Primary
    Severity      : Fatal

    Proc. Type    : x86/x64
    Instr. Set    : x64
    Error Type    : Micro-Architectural Error
    Flags         : 0x00
    CPU Version   : 0x00000000000206e6
    Processor ID  : 0x0000000000000037


    ===============================================================================
    Section 1     : x86/x64 Processor Specific
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa80643410f0
    Section       @ fffffa8064341240
    Offset        : 536
    Length        : 128
    Flags         : 0x00000000
    Severity      : Fatal

    Local APIC Id : 0x0000000000000037

    CPU Id
           : e6 06 02 00 00 08 20 37 - bd e3 bc 00 ff fb eb bf
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

    Proc. Info 0  @ fffffa8064341240

    ===============================================================================
    Section 2     : x86/x64 MCA
    -------------------------------------------------------------------------------
    Descriptor    @ fffffa8064341138
    Section       @ fffffa80643412c0
    Offset        : 664
    Length        : 264
    Flags         : 0x00000000
    Severity      : Fatal

    Error         : Internal unclassified (Proc 31 Bank 5)

    Status      : 0xfa00000000400405

     

    As you can see from the output, a WHEA error record is made of several sections.  Each section is actually a sub-section of the one above it. The sections go from most generic, to most specific, based on the exact type of error which occurred.

    CPER / WHEA record – this is defined in Appendix N of the UEFI spec version 2.2 (these can be obtained from www.uefi.org)

    The format of most of the sections is defined in the UEFI Spec version 2.2 as part of the Common Platform Error Record (CPER) definition.  The last section decribes a Machine Check Archtecture (MCA) which is defined by the processor manufacturer.  In this case, it is an Intel processor

    MCA information - The format of the last part of the record (MCA) is defined in the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
    Section 15 describes the MCA format and structure. Appendix E in Volume 3B has additional details on interpreting Machine-Check error codes

    Let’s take a look at what each of the sections represents:

    An error record is described by a WHEA_ERROR_RECORD structure, the error record header is described by a WHEA_ERROR_RECORD_HEADER structure, and the error record section descriptors are each described by a WHEA_ERROR_RECORD_SECTION_DESCRIPTOR structure.


    The CPER record header is a WHEA_ERROR_PACKET_V2, and describes the severity and type of error.  In this case it is a fatal Machine Check Exception (MCE)

    Section 0 is a Generic Processor error. This error record section contains processor error data that is not specific to a particular processor architecture. The data that is contained in this section is described by the WHEA_PROCESSOR_GENERIC_ERROR_SECTION structure.

    Section 1 is an x86/x64 Processor Error. This error record section contains processor error data that is specific to the x86 or x64 processor architecture. The data that is contained in this section is described by the WHEA_XPF_PROCESSOR_ERROR_SECTION structure.

    Section 2 is of type WHEA_XPF_MCA_SECTION and contains the machine check and other machine-specific register information. The actual structure which holds the MCA data is a Microsoft specific extension of the CPER specification.  We build this record by reading the Machine Specific Registers (MSRs) which are processor specific, and filling in the fields.  These (and many of the above) are defined in the header file cper.h in the SDK.

    Some of the questions which I was asked about this record, and their answers:

    1.  Why is the processor number (31) listed in the MCA record (Section 2) different than the processor id / APIC ID (37) in sections 0 and 1?

    The answer to this one is that the numbers have different meanings, and different sources.  The one in sections 0 and 1 is the initial APIC ID of the CPU which reported the machine check.  The APIC ID for a logical CPU is set by the hardware on boot.  The processor number in Section 2 is the logical processor number (the value returned from KeGetCurrentProcessorNumberEx) of the processor which is creating the WHEA error record. This may or may not be the same processor which reported the machine check error, depending on the IRQL which the processor generating the error was running.  If the IRQL was < DISPATCH_LEVEL, then it is scheduled to run on the reporting processor.  Otherwise, it may be logged on a different processor.

    How do you map APIC IDs to logical IDs?
    One way is using the !smt debugger extension.  This shows the APIC IDs and logical CPU number for all CPUs.

    No PRCB             SMT Set                                                                             APIC Id
    0 fffff8000da3ee80 **-------------------------------------------------------------- (0000000000000003) 0x00000080
    1 fffff8800260e180 **-------------------------------------------------------------- (0000000000000003) 0x00000081

    2.  How do you interpret the MCA error  Internal unclassified (Proc 31 Bank 5)”?

    In order to make sense of these, you need to determine a few pieces of information, then refer to the specific processor guide.

    As I mentioned previously, for this particular system, it is an Intel system so these are the resources you need to use:

    Section 15 in the  Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 3A
    Appendix E in Volume 3B
    has additional details on interpreting Machine-Check error codes

     

    a. CPU ID – What Family, Model, and Stepping is the CPU?

    !cpuid can show you this.  Or, you can parse it from the CPU ID in section 1.  In this case it is:
    CPU Id        : e6 06 02 00 00 08 20 37 - bd e3 bc 00 ff fb eb bf  // Family 6, Model 2e, stepping 6

    Table B-1 in Appendix B of the Intel guide says that this Family and Model is a “Intel Xeon Processor 7500 Series”

     

    b. What is the MCA Error code?

    In order to find this out, we need to parse the MCi_STATUS structure.  The ‘i’ is used in the Intel guides as a placeholder for the bank number.  An error bank is a processor specific set of MSRs.  Some banks are publically documented in what the type of error represents, and some are not.  If the bank is not documented, then you will need to check with the processor manufacturer.

     

    Now that we know the processor family and model, we can look up the meaning of specific bank of registers.  These are listed in this form: MSR_MCi_STATUS.  So since we know the bank number is 5, we can find the meaning of MSR_MC5_STATUS.  Here’s what the Intel guide shows:

     

    Table B5 MSRS IN THE INTEL® MICROARCHITECTURE CODENAME NEHALEM

     

    Table B-5

    Register (hex) Register (dec) Register Name Scope Bit Description
    414H 1044 MSR_MC5_STATUS Core See Section 15.3.2.1, “IA32_MCi_CTL MSRs.”
    415H 1045 MSR_MC5_STATUS Core See Section 15.3.2.2, “IA32_MCi_STATUS MSRS.”
    416H 1046 MSR_MC5_ADDR Core See Section 15.3.2.3, “IA32_MCi_ADDR MSRs.”
    417H 1047 MSR_MC5_MISC Core See Section 15.3.2.4, “IA32_MCi_MISC MSRs.”

     

    Now,referring to section 15.3.2.2, we can decode the value:

    MCI_STATUS
    +0x000 McaErrorCode     : 0x405  // binary:  0000 0100 0000 0101
    +0x002 ModelErrorCode   : 0x40  // binary: 0000 0000 0100 0000 // bit 22
    +0x004 OtherInformation : 0y00000000000000000000000 (0)
    +0x004 ActionRequired   : 0y0
    +0x004 Signalling       : 0y0
    +0x004 ContextCorrupt   : 0y1
    +0x004 AddressValid     : 0y0
    +0x004 MiscValid        : 0y1
    +0x004 ErrorEnabled     : 0y1
    +0x004 UncorrectedError : 0y1
    +0x004 StatusOverFlow   : 0y1
    +0x004 Valid            : 0y1
    +0x000 QuadPart         : 0xfa000000`00400405

     

    Section 15.9 discusses how to interpret these error codes.  From Table 8, “IA32_MCi_Status [15:0] Simple Error Code Encoding”, the meaning is given as:

    Internal Unclassified 0000 01xx xxxx xxxx Internal unclassified errors.

     

    This is why the error shows as “Internal Unclassified”.  Since this is not a publicly documented code, the next step would be to contact Intel for further information.  But, at least now you have verified the information and will have good data to send to the hardware manufacturer.  In other cases, the bank and MCA code may be more clearly documented and further action could be taken.

     

    Further Reading:

    There is more information regarding WHEA on MSDN and in several WinHEC conference presentations on the Microsoft site:

    WHEA Platform Implementation

    WHEA System Design and Implementation

     

    I hope this information was useful to understand how to interpret WHEA and MCA error codes. Until next time!

  • Ntdebugging Blog

    Hunting for Bugs, but Found a Worm

    • 5 Comments

    Hi All, my name is Ron Riddle and I’m an Escalation Engineer on the core Windows team.  I worked an issue recently wherein a svchost.exe was crashing due to heap corruption; so, after enabling Page Heap and breaking out the services as needed, I received a user-mode dump that would show me the culprit.  I was expecting to find a legitimate bug either in our code or a third-party module; but, much to my surprise, I found that malware had caused a buffer overrun and the subsequent crash.  With that, I would like to share the simple approach I took in identifying the malware within the dump file.

     

    1. I start by dumping out the offending call stack.  Notice that the debugger wasn’t able to map the code addresses to a loaded or unloaded module.

    0:003> kbn

     # ChildEBP RetAddr  Args to Child             

    WARNING: Frame IP not in any known module. Following frames may be wrong.

    00 02bcfdcc 7c81a35f 02b7ae40 7c81a3ab 00000004 0x2b685b0

    01 02bcfde4 02b68bfe 02b7ae40 00000000 77e424ee ntdll!LdrpCallInitRoutine+0x21

    02 02bcfde8 02b7ae40 00000000 77e424ee 02b7ae10 0x2b68bfe

    03 02bcfdec 00000000 77e424ee 02b7ae10 00000000 0x2b7ae40

     

    2. Next, I try to learn more about the mystery address, such as what larger allocation it was a part of.

    0:003> !address 0x2b685b0

    Usage:                  <unclassified>

    Allocation Base:        02b60000

    Base Address:           02b61000

    End Address:            02b81000

    Region Size:            00020000

    Type:                   00020000    MEM_PRIVATE

    State:                  00001000    MEM_COMMIT

    Protect:                00000040    PAGE_EXECUTE_READWRITE

     

    3. By now, I am suspicious of a rogue module, so I proceed in searching the aforementioned address range for a DOS Signature(i.e. 0x5A4D or “MZ”) that I know any Portable Executable file must contain.  I start with the Base Address from the above output and use the Region Size to specify my range.

    0:003> s -a 02b61000 l20000/4 "MZ"

    02b615d8  4d 5a 90 00 03 00 00 00-04 00 00 00 ff ff 00 00  MZ..............

    02b61bd0  4d 5a 75 f4 5f 83 c4 08-c2 04 00 55 8d 44 24 0c  MZu._......U.D$.

    02b67cd0  4d 5a 0f 85 69 01 00 00-8b 4d 7c 8b 46 3c 81 c1  MZ..i....M|.F<..

    02b681bf  4d 5a 74 07 33 c0 e9 c9-01 00 00 8b 45 0c 56 8b  MZt.3.......E.V.

     

    4. Now that I have some hits, I’ll start with the first one and verify whether it’s a valid module.  Bingo!

    0:003> !dh -a 02b615d8

     

    File Type: DLL

    FILE HEADER VALUES

         14C machine (i386)

           5 number of sections

    37304740 time date stamp Wed May 05 08:27:28 1999

     

           0 file pointer to symbol table

           0 number of symbols

          E0 size of optional header

        2102 characteristics

                Executable

                32 bit word machine

                DLL

     

    OPTIONAL HEADER VALUES

         10B magic #

        7.00 linker version

         600 size of code

         600 size of initialized data

           0 size of uninitialized data

        10B0 address of entry point

        1000 base of code

             ----- new -----

    10000000 image base

        1000 section alignment

         200 file alignment

           1 subsystem (Native)

        4.00 operating system version

        0.00 image version

        4.00 subsystem version

        6000 size of image

         400 size of headers

        41AE checksum

    00100000 size of stack reserve

    00001000 size of stack commit

    00100000 size of heap reserve

    00001000 size of heap commit

           0  DLL characteristics

           0 [       0] address [size] of Export Directory

        4000 [      28] address [size] of Import Directory

           0 [       0] address [size] of Resource Directory

           0 [       0] address [size] of Exception Directory

           0 [       0] address [size] of Security Directory

        5000 [      4C] address [size] of Base Relocation Directory

           0 [       0] address [size] of Debug Directory

           0 [       0] address [size] of Description Directory

           0 [       0] address [size] of Special Directory

           0 [       0] address [size] of Thread Storage Directory

           0 [       0] address [size] of Load Configuration Directory

           0 [       0] address [size] of Bound Import Directory

        2000 [      44] address [size] of Import Address Table Directory

           0 [       0] address [size] of Delay Import Directory

           0 [       0] address [size] of COR20 Header Directory

           0 [       0] address [size] of Reserved Directory

     

     

    SECTION HEADER #1

       .text name

         3CC virtual size

        1000 virtual address

         400 size of raw data

         400 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    68000020 flags

             Code

             Not Paged

             (no align specified)

             Execute Read

     

    SECTION HEADER #2

      .rdata name

          68 virtual size

        2000 virtual address

         200 size of raw data

         800 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    48000040 flags

             Initialized Data

             Not Paged

             (no align specified)

             Read Only

     

    SECTION HEADER #3

       .data name

          56 virtual size

        3000 virtual address

         200 size of raw data

         A00 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    C8000040 flags

             Initialized Data

             Not Paged

             (no align specified)

             Read Write

     

    SECTION HEADER #4

        INIT name

         1D4 virtual size

        4000 virtual address

         200 size of raw data

         C00 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    E2000020 flags

             Code

             Discardable

             (no align specified)

             Execute Read Write

     

    SECTION HEADER #5

      .reloc name

          82 virtual size

        5000 virtual address

         200 size of raw data

         E00 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    42000040 flags

             Initialized Data

             Discardable

             (no align specified)

             Read Only

     

    5. Because I’m not sure which sections might contain identifying characteristics, I decide to go spelunking through all the sections (except for the relocation section) looking for said characteristics that might help me to identify the rogue module.  I start with the relative virtual address of the .text section @ 0x1000 and continue through the INIT section @ 0x4000.

    0:003> dc 02b615d8+0x1000 l4000/4

    02b63c58  00000065 646c6977 73737265 72756365  e...wilderssecur

    02b63c68  00797469 65726874 78657461 74726570  ity.threatexpert

    02b63c78  00000000 74736163 6f63656c 00007370  ....castlecops..

    02b63c88  6d617073 73756168 00000000 65737063  spamhaus....cpse

    02b63c98  65727563 00000000 61637261 00746962  cure....arcabit.

    02b63ca8  69736d65 74666f73 00000000 626e7573  emsisoft....sunb

    02b63cb8  00746c65 75636573 6f636572 7475706d  elt.securecomput

    02b63cc8  00676e69 69736972 0000676e 76657270  ing.rising..prev

    02b63cd8  00000078 6f746370 00736c6f 6d726f6e  x...pctools.norm

    02b63ce8  00006e61 6f63376b 7475706d 00676e69  an..k7computing.

    02b63cf8  72616b69 00007375 72756168 00000069  ikarus..hauri...

    02b63d08  6b636168 74666f73 00000000 74616467  hacksoft....gdat

    02b63d18  00000061 74726f66 74656e69 00000000  a...fortinet....

    02b63d28  64697765 0000006f 6d616c63 00007661  ewido...clamav..

    02b63d38  6f6d6f63 00006f64 63697571 6165686b  comodo..quickhea

    02b63d48  0000006c 72697661 00000061 73617661  l...avira...avas

    02b63d58  00000074 66617365 00000065 6c6e6861  t...esafe...ahnl

    02b63d68  00006261 746e6563 636c6172 616d6d6f  ab..centralcomma

    02b63d78  0000646e 65777264 00000062 73697267  nd..drweb...gris

    02b63d88  0074666f 74657365 00000000 33646f6e  oft.eset....nod3

    02b63d98  00000032 72702d66 0000746f 74746f6a  2...f-prot..jott

    02b63da8  00000069 7073616b 6b737265 00000079  i...kaspersky...

    02b63db8  65732d66 65727563 00000000 706d6f63  f-secure....comp

    02b63dc8  72657475 6f737361 74616963 00007365  uterassociates..

    02b63dd8  7774656e 616b726f 636f7373 65746169  networkassociate

    02b63de8  00000073 75727465 00007473 646e6170  s...etrust..pand

    02b63df8  00000061 68706f73 0000736f 6e657274  a...sophos..tren

    02b63e08  63696d64 00006f72 6661636d 00006565  dmicro..mcafee..

    02b63e18  74726f6e 00006e6f 616d7973 6365746e  norton..symantec

    02b63e28  00000000 7263696d 666f736f 00000074  ....microsoft...

    02b63e38  65666564 7265646e 00000000 746f6f72  defender....root

    02b63e48  0074696b 776c616d 00657261 77797073  kit.malware.spyw

    02b63e58  00657261 75726976 00000073 304ce942  are.virus...B.L0

    02b64348  54464f53 45524157 63694d5c 6f736f72  SOFTWARE\Microso

    02b64358  575c7466 6f646e69 435c7377 65727275  ft\Windows\Curre

    02b64368  6556746e 6f697372 78655c6e 726f6c70  ntVersion\explor

    02b64378  415c7265 6e617664 5c646563 646c6f46  er\Advanced\Fold

    02b64388  485c7265 65646469 48535c6e 4c41574f  er\Hidden\SHOWAL

    02b64398  0000004c 63656843 5664656b 65756c61  L...CheckedValue

    02b63ee8  ffffffff 02b6a44f 02b6a453 70747468  ....O...S...http

    02b63ef8  772f2f3a 672e7777 796d7465 6f2e7069  ://www.getmyip.o

    02b63f08  00006772 70747468 772f2f3a 772e7777  rg..http://www.w

    02b63f18  73746168 7069796d 72646461 2e737365  hatsmyipaddress.

    02b63f28  006d6f63 70747468 772f2f3a 772e7777  com.http://www.w

    02b63f38  69746168 69796d73 726f2e70 00000067  hatismyip.org...

    02b63f48  70747468 632f2f3a 6b636568 642e7069  http://checkip.d

    02b63f58  6e646e79 726f2e73 00000067 61207069  yndns.org...ip a

    02b63f68  65726464 00007373 ffffffff 02b6a55e  ddress......^...

    02b64858  00000020 74666f53 65726177 63694d5c   ...Software\Mic

    02b64868  6f736f72 575c7466 6f646e69 435c7377  rosoft\Windows\C

    02b64878  65727275 6556746e 6f697372 75525c6e  urrentVersion\Ru

    02b64888  0000006e 646e7572 32336c6c 6578652e  n...rundll32.exe

    02b64898  73252220 73252c22 00000000 0065006e   "%s",%s....n.e.

    02b648a8  00730074 00630076 00000073 00000020  t.s.v.c.s... ...

     

    6. The list of anti-malware software vendors was a dead give-away that I was dealing with malware.  Finally, I conducted a Bing search using various artifacts from the preceding spew.  In the end, I was able to confirm that the rogue module was, in fact, the Conficker worm by simply running a full scan of the system using a signature-based scanner.

     

     

    I hope this walk-through provided you with techniques that you can leverage to identify rogue modules within your dump files, should that become necessary.  Until next time, happy bug-hunting and watch out for the worms!

  • Ntdebugging Blog

    New Joint Citrix/Microsoft Facebook Forum

    • 0 Comments

    Got a virtualization issue?.. Not sure if it’s a Microsoft or Citrix problem?.

    Jump on the new Facebook forum to discuss your issue, share ideas, opinions and information related to Citrix and Microsoft virtualization products.

    http://www.facebook.com/#!/pages/Citrix-and-Microsoft-Support/113004065426683

     

  • Ntdebugging Blog

    Debug Sleuth at Work: Hung Server…..Mystery of the unprocessed SMB work item.

    • 5 Comments

    Hello folks, Pushkar here. I recently worked on a case where the server was hung at “Applying User Settings” during the logon phase. You might ask what’s going to be new in this post, NTDebugging has bunch of posts covering such debug scenarios J. In my defense, this case was particularly interesting because it defied the general logic that I had about processing of network connection request. I am still a newbie in this area getting my feet wet in the arena of debugging. I had to seek help from one of the GES guru Aman Jain. He is an expert in the Windows Networking Stack. He helped me in confirming my suspicion.

     

    As we were stuck at “Applying User Settings”, the obvious path was to look at the activity of Winlogon to understand why is it stuck? I could see that there were seven logon requests which haven’t been processed yet.

     

    1: kd> !session

    Sessions on machine: 7

    Valid Sessions: 0 1 2 3 4 5 6

     

    1: kd> !process 0 0 winlogon.exe

    PROCESS fffffa806e132c10

        SessionId: 1  Cid: 0384    Peb: 7fffffdf000  ParentCid: 0298

        DirBase: db59d000  ObjectTable: fffff88007fd0430  HandleCount: 139.

        Image: winlogon.exe

    <snip>

     

    The above command returns 7 instances of Winlogon.exe, one instance per session. I picked one of the sessions as all the 7 sessions were in the same state.

     

    As this was a Windows 2008 Server, winlogon would open an LPC Channel with Logonui.exe to process the user’s logon and then it will call into Local Session Manager (LSM.EXE) to spawn the users’ session. I could see that winlogon is waiting on message reply from Logonui.exe.

     

    1: kd> !thread fffffa80`73d3d770

    THREAD fffffa8073d3d770  Cid 0384.1e2c  Teb: 000007fffffd7000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa8073d3db00  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801ffd9cf0 : queued at port fffffa804e9d9370 : owned by process fffffa806e134c10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e132c10       Image:         winlogon.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

    Context Switch Count      2            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x0000000076f67cb0

    Stack Init fffffa60110a9db0 Current fffffa60110a9670

    Base fffffa60110aa000 Limit fffffa60110a4000 Call 0

    Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`110a96b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`110a97f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`110a9860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`110a98f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`110a9980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`110a99e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`110a9b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`110a9bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`01f8f098 00000000`00000000 0x76f9756a

     

    1: kd> !alpc /m fffff8801ffd9cf0

     

    Message @ fffff8801ffd9cf0

      MessageID             : 0x0464 (1124)

      CallbackID            : 0x2844F0 (2639088)

      SequenceNumber        : 0x0000000B (11)

      Type                  : LPC_REQUEST

      DataLength            : 0x0064 (100)

      TotalLength           : 0x008C (140)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa80481ba9b0 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa8073d3d770

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e9d9370 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa806e134c10 (LogonUI.exe)

      ServerThread          : fffffa804f96d5a0                  <----------- LPC Message Server Thread

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 0000000000243d00

      ServerContext         : 0000000000000000

      PortContext           : 0000000001c27600

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    1: kd> !thread fffffa804f96d5a0

    THREAD fffffa804f96d5a0  Cid 01ac.09fc  Teb: 000007fffffd7000 Win32Thread: fffff900c067dd50 WAIT: (UserRequest) UserMode Non-Alertable

        fffffa80516be920  NotificationEvent

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e134c10       Image:         LogonUI.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192487        Ticks: 1882 (0:00:00:29.359)

    Context Switch Count      13                 LargeStack

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa60114e4db0 Current fffffa60114e4940

    Base fffffa60114e5000 Limit fffffa60114df000 Call 0

    Priority 14 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`114e4980 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`114e4ac0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`114e4b30 fffff800`02119608 nt!KeWaitForSingleObject+0x2cb

    fffffa60`114e4bc0 fffff800`01eb8ef3 nt!NtWaitForSingleObject+0x98

    fffffa60`114e4c20 00000000`76f96d5a nt!KiSystemServiceCopyEnd+0x13

    00000000`0444ef18 00000000`00000000 0x76f96d5a

     

    This server thread is processing a message, but we can’t tell what it is waiting for. If this thread was finished processing a message we would have seen “WAIT: (WrLpcReceive)” for the thread wait reason. Often this type of wait is for a critical section or other synchronization object that is waiting for another thread in the process. This begged an investigation of Logonui.exe, to see what the other threads were doing.

     

    I noticed that another Logonui.exe thread is waiting on LPC Message processing by a thread in LSM.EXE.

     

    1: kd> !thread fffffa80`4e9e5060

    THREAD fffffa804e9e5060  Cid 01ac.0438  Teb: 000007fffffd9000 Win32Thread: fffff900c06db2e0 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa804e9e53f0  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801ffaacf0 : queued at port fffffa804e9b6060 : owned by process fffffa806e130c10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e134c10       Image:         LogonUI.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      4192488        Ticks: 1881 (0:00:00:29.343)

    Context Switch Count      165                 LargeStack

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefbaa151c

    Stack Init fffffa6008d26db0 Current fffffa6008d26670

    Base fffffa6008d27000 Limit fffffa6008d20000 Call 0

    Priority 15 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`08d266b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`08d267f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`08d26860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`08d268f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`08d26980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`08d269e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`08d26b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`08d26bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`0272ea78 00000000`00000000 0x76f9756a

     

    At this point I decided to follow the LPC message chain trail to see where it ends and who is ultimately blocking it.

     

    1: kd> !alpc /m fffff8801ffaacf0

     

    Message @ fffff8801ffaacf0

      MessageID             : 0x0474 (1140)

      CallbackID            : 0x2844F6 (2639094)

      SequenceNumber        : 0x0000000D (13)

      Type                  : LPC_REQUEST

      DataLength            : 0x0058 (88)

      TotalLength           : 0x0080 (128)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa804ea03c80 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa804e9e5060

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e9b6060 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa806e130c10 (lsm.exe)

      ServerThread          : fffffa805180c700            <---- LPC Server thread for this message in LSM.EXE

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 0000000001cf0e90

      ServerContext         : 0000000000000000

      PortContext           : 0000000001dc9df0

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    I noticed that the server thread which is supposed to process the LPC message for above thread is itself waiting on a message reply from a thread in Svchost.exe. LSM had bunch of other threads which were waiting on LPC message processing from server threads in Svchost.exe. I dumped out all the threads and their LPC message structure to locate the server threads in Svchost.exe which seems to be doing some activity and found that only for the thread below the server thread had some interesting information as other LPC Server threads in svchost.exe were Notification Timers.

     

    1: kd> !thread fffffa80`50910060

    THREAD fffffa8050910060  Cid 02e0.1a70  Teb: 000007fffffda000 Win32Thread: 0000000000000000 WAIT: (WrLpcReply) UserMode Non-Alertable

        fffffa80509103f0  Semaphore Limit 0x1

    Waiting for reply to ALPC Message fffff8801fe00cf0 : queued at port fffffa804e876b60 : owned by process fffffa804ea4fc10

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa806e130c10       Image:         lsm.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      3187894        Ticks: 1006475 (0:04:21:41.110)

    Context Switch Count      97            

    UserTime                  00:00:00.000

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa60125b1db0 Current fffffa60125b1670

    Base fffffa60125b2000 Limit fffffa60125ac000 Call 0

    Priority 9 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Kernel stack not resident.

    Child-SP          RetAddr           Call Site

    fffffa60`125b16b0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`125b17f0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`125b1860 fffff800`01ee54e2 nt!KeWaitForSingleObject+0x2cb

    fffffa60`125b18f0 fffff800`0212ef34 nt!AlpcpSignalAndWait+0x92

    fffffa60`125b1980 fffff800`0212b9c6 nt!AlpcpReceiveSynchronousReply+0x44

    fffffa60`125b19e0 fffff800`0211b52f nt!AlpcpProcessSynchronousRequest+0x24f

    fffffa60`125b1b00 fffff800`01eb8ef3 nt!NtAlpcSendWaitReceivePort+0x19f

    fffffa60`125b1bb0 00000000`76f9756a nt!KiSystemServiceCopyEnd+0x13

    00000000`0205e538 00000000`00000000 0x76f9756a

     

    I looked at the server thread for this LPC message and that was also waiting on a notification event doing nothing. It again begged an investigation of the state of Svchost.exe to find out why the server thread hasn’t replied yet?

     

    1: kd> !alpc /m fffff8801fe00cf0

     

    Message @ fffff8801fe00cf0

      MessageID             : 0x03C4 (964)

      CallbackID            : 0x2140C2 (2179266)

      SequenceNumber        : 0x00000021 (33)

      Type                  : LPC_REQUEST

      DataLength            : 0x00C0 (192)

      TotalLength           : 0x00E8 (232)

      Canceled              : No

      Release               : No

      ReplyWaitReply        : No

      Continuation          : Yes

      OwnerPort             : fffffa804ef15d30 [ALPC_CLIENT_COMMUNICATION_PORT]

      WaitingThread         : fffffa8050910060

      QueueType             : ALPC_MSGQUEUE_PENDING

      QueuePort             : fffffa804e876b60 [ALPC_CONNECTION_PORT]

      QueuePortOwnerProcess : fffffa804ea4fc10 (svchost.exe)

      ServerThread          : fffffa8051435940

      QuotaCharged          : No

      CancelQueuePort       : 0000000000000000

      CancelSequencePort    : 0000000000000000

      CancelSequenceNumber  : 0x00000000 (0)

      ClientContext         : 00000000021a0f60

      ServerContext         : 0000000000000000

      PortContext           : 00000000033834d0

      CancelPortContext     : 0000000000000000

      SecurityData          : 0000000000000000

      View                  : 0000000000000000

     

    1: kd> !thread fffffa8051435940

    THREAD fffffa8051435940  Cid 0574.1ea0  Teb: 000007fffffd3000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

        fffffa601086d3e8  SynchronizationEvent

    IRP List:

        fffffa80514558b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

    Impersonation token:  fffff8801fc4b050 (Level Impersonation)

    DeviceMap                 fffff8800e7c50e0

    Owning Process            fffffa804ea4fc10       Image:         svchost.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      3187932        Ticks: 1006437 (0:04:21:40.517)

    Context Switch Count      111            

    UserTime                  00:00:00.015

    KernelTime                00:00:00.000

    Win32 Start Address 0x000007fefed47780

    Stack Init fffffa601086ddb0 Current fffffa601086d040

    Base fffffa601086e000 Limit fffffa6010868000 Call 0

    Priority 10 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`1086d080 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`1086d1c0 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`1086d230 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

    fffffa60`1086d2c0 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

    fffffa60`1086d320 fffffa60`015da75a nt!FsRtlCancellableWaitForSingleObject+0x27

    fffffa60`1086d360 fffffa60`015d8588 mup!MupLocateUncProviderForPath+0x40a

    fffffa60`1086d4a0 fffffa60`015d8130 mup!MupiCallUncProvider+0xe8

    fffffa60`1086d510 fffffa60`015d90ef mup!MupStateMachine+0x120

    fffffa60`1086d560 fffffa60`00f31e91 mup!MupCreate+0x2c3

    fffffa60`1086d5e0 fffffa60`00f4e26c fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x211

    fffffa60`1086d650 fffff800`021267d9 fltmgr!FltpCreate+0x25d

    fffffa60`1086d700 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

    fffffa60`1086d8a0 fffff800`0211f842 nt!ObpLookupObjectName+0x593

    fffffa60`1086d9b0 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

    fffffa60`1086da80 fffff800`0212ecc8 nt!IopCreateFile+0x287

    fffffa60`1086db20 fffff800`01eb8ef3 nt!NtCreateFile+0x78

    fffffa60`1086dbb0 00000000`76f9726a nt!KiSystemServiceCopyEnd+0x13

    00000000`2674d738 00000000`00000000 0x76f9726a

     

    Bingo! I found two threads in Svchost.exe which showed some activity and were blocked for over 4 hours on a UNC Path access. From the IRP which was issued by the thread above I found out that it was attempting to contact one of the DCs.

     

    1: kd> !irp 0xfffffa80`514558b0

    Irp is active with 4 stacks 3 is current (= 0xfffffa8051455a10)

    No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  0, 0]   0 e0 fffffa8031d1a840 fffffa805145aec0 fffffa6000f332b0-fffffa8047b6c930 Success Error Cancel

                 \FileSystem\Mup  fltmgr!FltpSynchronizedOperationCompletion

                      Args: fffffa601086d848 01000040 00030000 00000000

    [  0, 0]   0  0 fffffa8031d1b880 fffffa805145aec0 00000000-00000000   

                 \FileSystem\FltMgr

                      Args: fffffa601086d848 01000040 00030000 00000000

     

    1: kd> dt nt!_FILE_OBJECT fffffa805145aec0 FileName

       +0x058 FileName : _UNICODE_STRING "\DC005.contoso.com\PIPE\lsarpc"

     

    For the details on MUP you can refer to Support for UNC Naming and MUP.

     

    At first my thought was that something is wrong with the network stack because we are waiting on domain controller to respond. The thought was that either we had a networking issue or the Domain Controller wasn’t responding, but redirector would have timed out as it won’t wait for over 4 hours for an unresponsive connection. I started digging into the source code and found that we build a query context in MupLocateUncProviderForPath () function and issue another IRP which is sent down the I/O Stack. I searched for other IRPs which was issued by the above thread and found two IRPs.

     

    1: kd> !irpfind 0 0 thread fffffa8051435940

    Looking for IRPs with thread == fffffa8051435940

     

    Scanning large pool allocation table for Tag: Irp? (fffffa804ea83000 : fffffa804ed83000)

     

     

    Searching NonPaged pool (fffffa8030605000 : fffffa8c27e00000) for Tag: Irp?

     

      Irp    [ Thread ] irpStack: (Mj,Mn)   DevObj  [Driver]         MDL Process

    fffffa80514558b0 [fffffa8051435940] irpStack: ( 0, 0)  fffffa8031d1a840 [ \FileSystem\Mup]

    fffffa8073bc65e0 [fffffa8051435940] irpStack: ( e, 0)  fffffa804e842610 [ \FileSystem\mrxsmb]

      

    1: kd> !irp fffffa8073bc65e0

    Irp is active with 2 stacks 1 is current (= 0xfffffa8073bc66b0)

     No Mdl: No System Buffer: Thread fffffa8051435940:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

    >[  e, 0]   0 e1 fffffa804e842610 fffffa804e843f20 fffffa60015d4024-fffffa601086d3b0 Success Error Cancel pending

                 \FileSystem\mrxsmb     mup!MupiQueryPathCompletionRoutine

                      Args: 00000004 00000050 00140193 fffffa601086d408

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

     

    This is the IRP that MUP has issued an IOCTL IOCTL_REDIR_QUERY_PATH to the redirector.  If the status returned by the driver is anything other than STATUS_PENDING then it will wait by calling FsRtlCancellableWaitForSingleObject.

     

    The 0xfffffa8051435940 thread doesn’t have any redirector function in its stack, which made me suspicious that there could be some other thread which is already working on similar operation and redirector is currently busy processing that request. I started looking for a worker thread which would be calling the redirector.

     

    1: kd> !stacks 0 mrxsmb

    Proc.Thread  .Thread  Ticks   ThreadState Blocker

                                [fffffa804e9c7c10 svchost.exe]

     248.000e84  fffffa804ee72bb0 ffe0d0ad Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

                                [fffffa804e9e12e0 svchost.exe]

     404.000f18  fffffa804edad4f0 ffe03e7e Blocked    nt!FsRtlCancellableWaitForMultipleObjects+0x62

     

    The above command will parse through all the kernel stacks for all the process and dump out one line summary info for the threads where it finds reference to that module in its stack. I have snipped the output to keep things in context.

     

    One of the sample threads where redirector is in action is shown below.

     

    1: kd> !thread fffffa804ee72bb0

    THREAD fffffa804ee72bb0  Cid 0248.0e84  Teb: 000007fffffa8000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Alertable

        fffffa8048f6c6b8  SynchronizationEvent

    IRP List:

        fffffa8073ac3980: (0006,03a0) Flags: 00000884  Mdl: 00000000

        fffffa807257fc60: (0006,03a0) Flags: 00000884  Mdl: 00000000

        fffffa80511627b0: (0006,03a0) Flags: 00000884  Mdl: 00000000

    Impersonation token:  fffff8800d515c40 (Level Impersonation)

    Owning Process            fffffa804e9c7c10       Image:         svchost.exe

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      2043731        Ticks: 2150638 (0:09:19:10.167)

    Context Switch Count      893            

    UserTime                  00:00:00.031

    KernelTime                00:00:00.109

    Win32 Start Address 0x000007fefb96c534

    Stack Init fffffa60125a1db0 Current fffffa60125a0ab0

    Base fffffa60125a2000 Limit fffffa601259c000 Call 0

    Priority 10 BasePriority 1 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`125a0af0 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`125a0c30 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`125a0ca0 fffff800`020d2a2e nt!KeWaitForSingleObject+0x2cb

    fffffa60`125a0d30 fffff800`020d2a87 nt!FsRtlCancellableWaitForMultipleObjects+0x62

    fffffa60`125a0d90 fffffa60`0f625c14 nt!FsRtlCancellableWaitForSingleObject+0x27

    fffffa60`125a0dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

    fffffa60`125a0e10 fffffa60`0f63599f rdbss!RxFindOrCreateConnections+0x572

    fffffa60`125a0ef0 fffffa60`0f627318 rdbss!RxConstructVirtualNetRoot+0xb7

    fffffa60`125a0f90 fffffa60`0f63815a rdbss!RxFindOrConstructVirtualNetRoot+0x594

    fffffa60`125a1100 fffffa60`0f627eee rdbss!RxCreateTreeConnect+0x13e

    fffffa60`125a1190 fffffa60`0f6041f6 rdbss!RxCommonCreate+0x20a

    fffffa60`125a1250 fffffa60`0f62568a rdbss!RxFsdCommonDispatch+0x786

    fffffa60`125a1340 fffffa60`1031f2cf rdbss!RxFsdDispatch+0x21a

    fffffa60`125a13b0 fffffa60`015d85f9 mrxsmb!MRxSmbFsdDispatch+0xbf

    fffffa60`125a13f0 fffffa60`015d8130 mup!MupiCallUncProvider+0x159

    fffffa60`125a1460 fffffa60`015d90ef mup!MupStateMachine+0x120

    fffffa60`125a14b0 fffffa60`00f4e342 mup!MupCreate+0x2c3

    fffffa60`125a1530 fffff800`021267d9 fltmgr!FltpCreate+0x333

    fffffa60`125a15e0 fffff800`0211ecb2 nt!IopParseDevice+0x5f9

    fffffa60`125a1780 fffff800`0211f842 nt!ObpLookupObjectName+0x593

    fffffa60`125a1890 fffff800`02124e47 nt!ObOpenObjectByName+0x2f2

    fffffa60`125a1960 fffff800`0212ecc8 nt!IopCreateFile+0x287

    fffffa60`125a1a00 fffff800`01eb8ef3 nt!NtCreateFile+0x78

    fffffa60`125a1a90 fffff800`01eb9400 nt!KiSystemServiceCopyEnd+0x13

    fffffa60`125a1c98 fffffa60`0f664a5e nt!KiServiceLinkage

    <snip>

     

    The above thread is waiting for the condition of the remote connection to become stable and that’s the reason it called the function RxWaitForStableCondition(), which accepts RX_CONTEXT as its first argument and the reason for why we entered into the wait as second argument. As we can see that the thread above was waiting for over 9 hours and 19 minutes.

     

    For additional reading you can go through RX_CONTEXT and IRP Management.

     

    The unassembled code for the function below is to find out where we store the pointer to the RX_CONTEXT structure. You may remember about the challenges of x64 debugging from this post.

     

    1: kd> .frame /r 05

    05 fffffa60`11e42dd0 fffffa60`0f634e9e rdbss!RxWaitForStableCondition+0x11c

    rax=0000000000000000 rbx=fffffa804f4de010 rcx=0000000000000000

    rdx=0000000000000000 rsi=0000000000000001 rdi=0000000000000000

    rip=fffffa600f625c14 rsp=fffffa6011e42dd0 rbp=fffffa805113a490

    r8=0000000000000000  r9=0000000000000000 r10=0000000000000000

    r11=0000000000000000 r12=fffffa805113a4a0 r13=fffffa6011e42fd8

    r14=fffffa804f4de010 r15=0000000000000000

    iopl=0         nv up di pl nz na pe nc

    cs=0000  ss=0000  ds=0000  es=0000  fs=0000  gs=0000             efl=00000000

    rdbss!RxWaitForStableCondition+0x11c:

    fffffa60`0f625c14 85c0            test    eax,eax

     

    1: kd> u rdbss!RxWaitForStableCondition

    rdbss!RxWaitForStableCondition:

    fffffa60`0f625af8 488bc4          mov     rax,rsp

    fffffa60`0f625afb 48895808        mov     qword ptr [rax+8],rbx

    fffffa60`0f625aff 48896810        mov     qword ptr [rax+10h],rbp

    fffffa60`0f625b03 48897018        mov     qword ptr [rax+18h],rsi

    fffffa60`0f625b07 48897820        mov     qword ptr [rax+20h],rdi

    fffffa60`0f625b0b 4154            push    r12

    fffffa60`0f625b0d 4883ec30        sub     rsp,30h

    fffffa60`0f625b11 33ff            xor     edi,edi

    1: kd> u

    rdbss!RxWaitForStableCondition+0x1b:

    fffffa60`0f625b13 833a03          cmp     dword ptr [rdx],3

    fffffa60`0f625b16 418af1          mov     sil,r9b

    fffffa60`0f625b19 4d8be0          mov     r12,r8

    fffffa60`0f625b1c 488bea          mov     rbp,rdx

    fffffa60`0f625b1f 488bd9          mov     rbx,rcx     <--------- RX_CONTEXT is in RBX register

    fffffa60`0f625b22 7c07            jl      rdbss!RxWaitForStableCondition+0x33 (fffffa60`0f625b2b)

    fffffa60`0f625b24 33c0            xor     eax,eax

    fffffa60`0f625b26 e968010000      jmp     rdbss!RxWaitForStableCondition+0x19b (fffffa60`0f625c93)

     

    1: kd> !pool fffffa804f4de010 2           <---- [Argument 1] Here is pointer to RX_CONTEXT Structure

    Pool page fffffa804f4de010 region is Nonpaged pool

    *fffffa804f4de000 size:  670 previous size:    0  (Allocated) *RxIr

                Pooltag RxIr : RDBSS RxContext, Binary : rdbss.sys

     

    1: kd> !pool fffffa805113a490 2           <---- [Argument 2] Here is the Pointer to the Reason for Wait

    Pool page fffffa805113a490 region is Nonpaged pool

    *fffffa805113a3b0 size:  280 previous size:   10  (Allocated) *RxSc

                Pooltag RxSc : RDBSS SrvCall, Binary : rdbss.sys

     

    1: kd> dc fffffa805113a490 L1

    fffffa80`5113a490  00000001               <---------------- Transport is in Transition state

     

    We have bunch of RX_CONTEXT describing the IRPs pending on redirector for the transport of the remote connection to stabilize.

     

    Next obvious thought that came to my mind was to investigate the list where we store the IRPs which have been issued by the redirector for the transport driver to process and understand the state of those IRPs and understand why we are not progressing further? RxCeSend is the API which submits the IRPs to the Transport driver which is maintained in the list below. For more information about Redirector related APIs you can refer Connection Engine Management.

     

    1: kd> x rdbss!RxIrpsList

    fffffa60`0f620110 rdbss!RxIrpsList = <no type information>

     

    The rdbss!RxIrpsList is a pointer to the structure of type LIST_ENTRY.

     

    1: kd> dq fffffa60`0f620110 L2

    fffffa60`0f620110  fffffa80`510fd0d0 fffffa80`4edb5590

     

    1: kd> dq fffffa80`510fd0d0 L2            <---------------- LIST_ENTRY.FLINK 

    fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

     

    1: kd> dq fffffa80`4edb5590 L2            <---------------- LIST_ENTRY.BLINK

    fffffa80`4edb5590  fffffa60`0f620110 fffffa80`4ff450d0

     

    This being a x64 windows each entry in this list is 0x30 bytes, this can change with OS version or driver updates, so don’t rely on this list or any of the offset that I referenced here in your driver code. Entry in the first column of this list is a pointer to the list entry. At offset 10 we store the pointer to the IRP. Add 0x10 bytes to get the IRP.

     

    1: kd> dl 0xfffffa80`510fd0d0

    fffffa80`510fd0d0  fffffa80`50d2e300 fffffa60`0f620110

    fffffa80`510fd0e0  fffffa80`7274e5a0 00000000`00000000

    fffffa80`50d2e300  fffffa80`50ee4490 fffffa80`510fd0d0

    fffffa80`50d2e310  fffffa80`50f16160 00000000`00000000

    fffffa80`50ee4490  fffffa80`727f1140 fffffa80`50d2e300

    fffffa80`50ee44a0  fffffa80`51491760 00000000`00000000

    fffffa80`727f1140  fffffa80`73a95860 fffffa80`50ee4490

    fffffa80`727f1150  fffffa80`511b9c60 00000000`00000000

    fffffa80`73a95860  fffffa80`73a8e3e0 fffffa80`727f1140

    fffffa80`73a95870  fffffa80`73a77c60 00000000`00000000

    fffffa80`73a8e3e0  fffffa80`73a9c290 fffffa80`73a95860

    fffffa80`73a8e3f0  fffffa80`73a46a30 00000000`00000000

    fffffa80`73a9c290  fffffa80`725ca490 fffffa80`73a8e3e0

    <Snip>

     

    1: kd> !irp poi(fffffa80`510fd0d0+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa807274e6b8)

     No Mdl: No System Buffer: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  f, 3]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa601030df70-fffffa805119fda0 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousConnectCompletion

                      Args: 00000000 fffffa8050d127e0 00000000 fffffa601031b6f0

     

    1: kd> !irp poi(fffffa80`50d2e300+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa8050f16278)

     No Mdl: System buffer=fffffa8050d59c88: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa8050d59c60 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                      Args: 00000000 00000030 0021c0e8 00000000

     

    1: kd> !irp poi(fffffa80`50ee4490+10)

    Irp is active with 2 stacks 2 is current (= 0xfffffa8051491878)

     No Mdl: System buffer=fffffa805115dcb8: Thread 00000000:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                      Args: 00000000 00000000 00000000 00000000

    >[  e, 0]   0 e1 fffffa804b1fd8d0 fffffa8050f012f0 fffffa60103063a4-fffffa805115dc90 Success Error Cancel pending

                 \Driver\Smb      mrxsmb!RxTdiAsynchronousRequestCompletion

                      Args: 00000000 00000030 0021c0e8 00000000

     

    All these IRPs are waiting on SMB Transport Driver (responsible for handling TCP/IP port 445 related traffic) for further processing. The obvious feeling was to find out if there are threads which are making call to SMB function to see what they are they doing?

     

    Note: In Windows 2008 R2 and Windows 7 onwards redirector doesn’t use SMB as the transport driver directly.

     

    I searched through the dump but I couldn’t find a single thread which was calling any of the SMB function. This is the point where it gets more interesting. Why we don’t have any thread which is calling SMB function when we have bunch of IRPs pending on that driver?

     

    This was the time for me to collaborate with the Network Stack Guru Aman Jain to get some insight to figure out the answer to my question.

     

    It turned out that SMB driver will create a work item and it will use the System Delayed Worker Queue to queue the work item that it created and will populate the routine in the work Item structure which is supposed to process the work item once the worker thread picks up the work item.

     

    1: kd> x nt!ExWorkerQueue

    fffff800`01ff08a0 nt!ExWorkerQueue = <no type information>

     

    Above pointer is the list head for the 3 queues that system has namely the “Critical WorkQueue” , “Delayed WorkQueue” and the “HyperCritical WorkQueue” as see in the output of !exqueue command in the debugger. Each entry in this list is 0x58 bytes long and is of type _KQUEUE.

     

    Structure referenced below decides how a various work queues are enumerated. This structure is documented in this MSDN article.

     

    typedef enum _WORK_QUEUE_TYPE {

      CriticalWorkQueue        = 0,

      DelayedWorkQueue         = 1,

      HyperCriticalWorkQueue   = 2,

      MaximumWorkQueue         = 3

    } WORK_QUEUE_TYPE;

     

    Critical WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0   

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff08b8 - 0xfffff800`01ff08b8 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4da08 - 0xfffffa80`30a39d68 ]

     

    Delayed WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffffa80`50226f00 - 0xfffffa80`5183f9f0 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a398d8 - 0xfffffa80`51157d68 ]

     

    HyperCritical WorkQueue

    1: kd> dt nt!_KQUEUE fffff800`01ff08a0+58+58

       +0x000 Header           : _DISPATCHER_HEADER

       +0x018 EntryListHead    : _LIST_ENTRY [ 0xfffff800`01ff0968 - 0xfffff800`01ff0968 ]

       +0x028 CurrentCount     : 0

       +0x02c MaximumCount     : 0x10

       +0x030 ThreadListHead   : _LIST_ENTRY [ 0xfffffa80`30a4e1f8 - 0xfffffa80`30a4e1f8 ]

     

    1: kd> dd fffff80001ff08a0+58+40 L1

    fffff800`01ff0938  00000010 <----------- We created 0x10 (16) dynamic threads

     

    We create additional dynamic threads as in the event when the default number of threads in a particular work queue is unable to handle the Work Items being queued. In this case we created 16 additional threads, which is the maximum number of dynamic thread which can be created.

     

    1: kd> dd fffff80001ff08a0+58+4c L1

    fffff800`01ff0944  00004667

     

    1: kd> ? 00004667

    Evaluate expression: 18023 = 00000000`00004667  <-------- 18,023 work Items pending for processing

     

    I could see that there are work items pending in the Delayed WorkQueue because of the following relation which is not present in any other WorkQueue that we dumped.

     

    nt!_KQUEUE->EntryListHead.Flink != nt!_QUEUE->EntryListHead.Blink

     

    For details about Work Item Processing and Work Items you can refer following docs:

    System Worker Threads

    IO_WORKITEM

    IoSizeofWorkItem

     

    The next obvious steps involved dumping all work items pending in the list to see if I have work items related to SMB driver? And if the Work Items are present for SMB then why they haven’t been picked up by the worker thread?

     

    1: kd> !list "-t nt!_LIST_ENTRY.Flink -e -x\"dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1\" fffffa80`50226f00"

     

    <Snip>

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`4b7c7e30  fffffa80`50a720e0

    fffffa80`4b7c7e50  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`72544e00  fffffa80`31aeb580

    fffffa80`72544e20  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`72759d50  fffffa80`51184690

    fffffa80`72759d70  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`73b416e0  fffffa80`5143c9c0

    fffffa80`73b41700  fffffa60`0f5002e4 smb!SmbDelayedStartTcpSession

     

    <snip>

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`518392e0  fffffa80`31fafa60

    fffffa80`51839300  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafa60  fffffa80`31fafab0

    fffffa80`31fafa80  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafab0  fffffa80`31fafc30

    fffffa80`31fafad0  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`31fafc30  fffffa80`5183f9a0

    fffffa80`31fafc50  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`5183f9a0  fffffa80`4ea092e0

    fffffa80`5183f9c0  fffffa60`00becb58 mpio!MPIOPathVerify

     

    dq poi(@$extret) L1; dps poi(@$extret)+0x20 L1

    fffffa80`4ea092e0  fffffa80`5183f9f0

    fffffa80`4ea09300  fffffa60`00becb58 mpio!MPIOPathVerify

     

    Voila! Here are my SMB work items in the Delayed WorkQueue waiting to be processed. Another interesting observation was that this queue had around 18,023 work items in this queue waiting to be picked up by the worker threads. Most of the work items were related to MPIO. Investigation of the worker threads in the Delayed WorkQueue revealed that the worker threads were blocked processing MPIO work item on the DSM driver for the storage.

     

    I picked one of the worker entry from the _KQUEUE->ThreadListHead for the Delayed WorkQueue and validated that it is a thread pool allocation with !pool. As the size of _ETHREAD structure is 0x450 bytes and the pool allocation was for 0x490 bytes, this indicates that the pointer to the actual thread object would be 0x40 bytes into that pool allocation from the beginning of that pool block.

     

    1: kd> !pool 0xfffffa80`30a398d8  2

    Pool page fffffa8030a398d8 region is Nonpaged pool

    *fffffa8030a396e0 size:  490 previous size:   90  (Allocated) *Thre (Protected)

                Pooltag Thre : Thread objects, Binary : nt!ps

     

    1: kd> ?? sizeof(nt!_ETHREAD)

    unsigned int64 0x450

     

    1: kd> !thread fffffa8030a396e0+40

    THREAD fffffa8030a39720  Cid 0004.0044  Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (Executive) KernelMode Non-Alertable

        fffffa8031e8fe58  SynchronizationEvent

    Not impersonating

    DeviceMap                 fffff880000073a0

    Owning Process            fffffa80309ca110       Image:         System

    Attached Process          N/A            Image:         N/A

    Wait Start TickCount      2015907        Ticks: 2178462 (0:09:26:24.225)

    Context Switch Count      42461            

    UserTime                  00:00:00.000

    KernelTime                00:00:14.976

    Win32 Start Address nt!ExpWorkerThread (0xfffff80001ec07c8)

    Stack Init fffffa600232bdb0 Current fffffa600232b800

    Base fffffa600232c000 Limit fffffa6002326000 Call 0

    Priority 15 BasePriority 12 PriorityDecrement 0 IoPriority 2 PagePriority 5

    Child-SP          RetAddr           Call Site

    fffffa60`0232b840 fffff800`01ebb6fa nt!KiSwapContext+0x7f

    fffffa60`0232b980 fffff800`01eb035b nt!KiSwapThread+0x13a

    fffffa60`0232b9f0 fffffa60`00c9c698 nt!KeWaitForSingleObject+0x2cb

    fffffa60`0232ba80 fffffa60`00c97a06 MyBadDSM+0xb698

    fffffa60`0232bb10 fffffa60`00becca3 MyBadDSM+0x6a06

    fffffa60`0232bb60 fffff800`02119693 mpio!MPIOPathVerify+0x14b

    fffffa60`0232bcc0 fffff800`01ec08c3 nt!IopProcessWorkItem+0x27

    fffffa60`0232bcf0 fffff800`020c3ee7 nt!ExpWorkerThread+0xfb

    fffffa60`0232bd50 fffff800`01ef6616 nt!PspSystemThreadStartup+0x57

    fffffa60`0232bd80 00000000`00000000 nt!KxStartSystemThread+0x16

     

    At this point it was evident that something has gone wrong with either the storage or the HBA adapter, Inspection of the MyBadDSM.sys code at this point could reveal if it was the problem with the driver or something has gone wrong with the HBA or the Fiber connection to the storage.

     

    This brings me to the end of this post. I hope you enjoyed reading this post. Goodbye. Until next time!

  • Ntdebugging Blog

    Expanding Netmon to aid in debugging

    • 2 Comments

    A walk-through of creating a Netmon parser in the context of a real case

     

    As is obvious to frequent readers of our blog, our team logs a lot of time in our debugger of choice (for some windbg, for others kd).  However, a debugger is not always the best tool for the job, and sometimes even the best tool has limitations. I found this to especially true when working a few Internet Printing Protocol (IPP) cases recently.

     

    Probably the biggest challenge of many IPP cases is the mixed environments you usually find IPP running in.  The benefit customers see in IPP over other print providers is it works natively or with minimal configuration on Windows, Mac, and Linux clients.  This makes it popular in places like college computer labs, where there isn’t one standard client system.   Unfortunately, this also means that we can’t really debug both sides of the communication when something goes wrong.

     

    In a recent case, a customer was having problems printing from their Windows Vista clients to a Linux Common Unix Printing System (CUPS) IPP server.  If the CUPS server was set to allow anonymous connections, everything worked perfectly. When the administrator enabled authentication, he found that most print jobs failed to print.  After a bit more testing, he found that small jobs (perhaps a page of printed text) worked fine, but larger, multi-page documents failed.

     

    For situations like this, I prefer to use a network trace to get a feeling for where the problem is occurring.  The problem was – IPP wasn’t one of the protocols built in to Netmon (and I find Wireshark’s IPP parser to not always work well – especially with Windows clients/servers).  I decided that the amount of time it would take to decode the traffic by hand could be better spent creating a Netmon IPP parser that I could use every time I ran into one of these issues.

     

    One of the great things about Netmon is you can view the source of every included parser.  This was a big help, as I hadn’t written a parser before.  [Note: all steps noted are written using Netmon 3.4.There might be slight differences in Netmon 3.3.]  To do this, open Netmon and click the Parsers tab. Under Object View, expand parser files and double click any of the .npl files.  The source will appear on the right.

     

    The language for Netmon parsers is similar to C++, with a limited set of statements.  These are all documented in the Netmon help file, but the ones I found useful are described below.  To begin, I started by defining a number of tables.  The basic idea of a table is to provide a way to convert a value to a string.  For example, one field in an IPP packet is the status of a printer, which is represented by an integer.  In order to allow Netmon to show printer states in a readable form, I created a table to convert the values as seen in Figure 1 below.

     

    Table IPPPrinterState //2911-4.4.11

    {

          switch(value)

          {

                case 3 : "idle";

                case 4 : "processing";

                case 5 : "stopped";

                default : "Unknown Code";

          }

    }

    Figure 1: Netmon Table

     

    Each table is defined with the Table keyword, followed by a name for the table.  It may optionally be followed by a list of parameters, which I’ll use later.  In this case, I added a comment that specified which RFC and section this information comes from.  A table consists of a switch statement with a case for each value, and a default for all other cases, much like other programming languages.  I created tables like IPPPrinterState for each field that could be represented in an IPP packet from information I found in each of IPP’s RFCs.

     

    Once the tables were complete, I moved on to creating the Protocol portion of the parser.  This section of the code provides the logic that iterates through a packet and calls the tables for the appropriate data.  This section starts with either the RegisterBefore or RegisterAfter keyword.  These are used to determine how your parser is called.  Essentially, Netmon takes all of the parsers it has, and compiles them into one continuous binary (.npb) and registration tells Netmon where your parser fits.  For my case, I used the following registration code.

     

    [ RegisterAfter (HTTPPayloadData.OCSP, Ipp, Property.HTTPContentType.Contains("application/ipp")) ]

     

    This tells Netmon that, when compiling the parser, it should insert my code right after the code for the OCSP protocol in its HTTPPayloadData parser, my protocol should be called IPP, and it should enter my code path if the HTTP Payload is of content type “application/ipp”.  This allows my parser to work a bit differently than the Wireshark IPP parser – Wireshark uses a port number (631) to identify IPP traffic, whereas my code looks at HTTP content types.  The advantage of this, for me, is that Windows servers use port 80 for IPP by default, not 631, so in cases with a Windows IPP server, this parser should correctly identify the packets.  You may be wondering how or why I chose to register after OCSP.  Basically, I knew I needed my code to be registered in the section of code where HTTP does its payload content type processing.  So I opened up HTTP’s parser, and searched for the content type analysis. OCSP was the first protocol I found in HTTP’s content type logic, so I used that as the place to insert my protocol.

     

    After the registration comes the Protocol statement. I chose the following.

     

    Protocol IPP = FormatString("Status/OpCode = %s", IPPOperationsSupported(OpID))

     

    This names my protocol IPP and specifies that I want the description of the protocol to display the IPP status code.  This way, a user doesn’t need to drill down to find out if this is a print job or a printer status request.  You’ll notice FormatString is a function in Netmon that is similar to printf.  In this case, I am passing a variable (OpID, which is defined lower in my code) to my IPPOperationsSupported table to determine what this OpCode means.  Before I had a parser, I would need to look up the operations supported values in the IPP RFC for each packet I examined.

     

    Next is the body of the protocol.  Basically, this consists of a series of fields (like variables) that define how a packet is laid out.  Creating a field is similar to declaring a variable in C++. You start by choosing a data type that matches the size of the data in the packet and provide a name for that field.  For example, Figure 2 shown below contains the first seven lines of my Protocol.

     

    struct Version = FormatString("%i.%i", Major, Minor)

    {

          INT8 Major;

          INT8 Minor;

    }

    INT16 OpId = FormatString("%s (0x%X)", IPPOperationsSupported(this), this);

    INT32 RequestId;

    Figure 2: Code in the protocol block

     

    The IPP specification states that all packets begin with two 8-bit values, the first value specifies the major protocol version in use, and the second value specifies the minor.  In this case, I wrapped both in a struct so Netmon will display them as “Version: 1.0”, instead of separately as “Major: 1” “Minor: 0” on two lines.  After the version is a 16-bit field that specifies the operation requested (for example, print-job or get-printer-state).  I choose to display this value by looking it up in the IPPOperationsSupported table, then printing it as the string, followed by the hex value (e.g. “Get-Printer-Attributes (0xB)”).  The ‘this’ keyword simply uses the value of the current field, which in this case is the OpId.  Even though Netmon parses through the packet sequentially, this kind of use of a Field before its value is retrieved is allowed.  Finally, I set the RequestId field, which is a 32-bit int value.  Since this field is just a transaction ID for this conversation, I don’t need to do any formatting to it.

     

    After that, things got a little more complicated.  IPP’s structure allows for a variable number of attribute groups, each of which can contain a variable number of attributes.  For example, in response to the request “Get-Printer-Attributes” from the client, the server responds with the Printer Attributes group, which contains a number of attributes like printer-state, queued-job-count, and so on.  First, I needed to deal with the attribute groups in a loop until I’d read each one.  IPP specifies that the end of the attribute groups is specified with the value of 0x03, so I wrote a while loop to create attribute groups until FrameData[FrameOffset] is equal to 3 (See Figure 3 below).  FrameData and FrameOffset are special values provided by Netmon.  FrameData is an array of the entire contents of the frame, and FrameOffset is Netmon’s current location in the packet.  I use this instead of declaring a field here because referencing FrameData[FrameOffset] does not advance the parser frame location.  This is important because I want to consume that value further down.

     

    Inside that loop, I declared another struct that contains an attribute group.  Much like the Protocol IPP line above, we reference a field here that will be declared lower down.  This line does not advance the FrameOffset, since we don’t declare a field here.  The first line of this struct is the field declaration line that finally consumes the attribute group tag.  Below that is another While loop to process all attributes in the attribute group.  IPP differentiates between attributes and attribute groups by making all attribute group identifiers smaller than 0x10, and all attribute identifiers 0x10 or higher.  I use this as the condition for my loop.  Finally, I declare an Attribute struct inside this loop.  This struct is displayed after looking up how to properly print based on the Attribute Name and Value in the AttribDisplayTable.

     

    IPP declares attributes as an 8-bit type identifier (int, bool, string, etc.), a 16-bit int specifying the attribute name’s length, the name (a string), a 16-bit in value length, and a value.  Since I want to look up the value in various tables, depending on the Attribute Name, I store the Attribute Name as “AttName” in a property.  This way, I can continue to reference it while processing continues.  Properties are declared in brackets just above the field they will store.  In my case, I prepend the ‘Post.’ evaluation keyword to the property name.  This instructs Netmon to use the end result of the next line as its value, but before advancing the FrameOffset.  I do this again for the actual value, which I call Val.  If I did not use the Post evaluation keyword, Val would contain the unsigned int32 value of printer state, instead of the formatted string result I get by looking up printer state in its table.

     

    While [FrameData[FrameOffset] != 0x03]

    {

          struct AttributeGroup = FormatString("%s", IPPTags(TagGroup))

          {

                INT8 TagGroup = FormatString("%s (0x%X)", IPPTags(this), this);

                While [FrameData[FrameOffset] >= 0x10]

                {

                      struct Attribute = AttribDisplayTable(AttName, Val)

                      {

                            INT8 Type = FormatString("%s (0x%X)", IPPTags(this), this);

                            INT16 NameLength;

                            [Post.AttName]

                            AsciiString(NameLength) AttributeName;   

                            INT16 ValueLength;

                            switch(AttName)

                            {

                                  case "printer-state" :

                                  [Post.Val]

                                  UINT32 PrinterState = FormatString("%s (0x%X)", IPPPrinterState(this),this);

                                  ...

    Figure 3: Loops in protocol block

     

    My case statements continue like printer-state for all possible attributes of IPP.  At the very end of the protocol block, after I’ve closed my switch, structs, and whiles, I have one more line, which consumes any data remaining in the packet.  This would contain document data if the packet was a print job, and is required so all the packet data is consumed before Netmon moves on to the next packet.  That line is:

     

    BLOB (FrameLength - FrameOffset) Data;

     

    As you can see, it is a binary blob data type, set to the size of the frame, less our current location.

     

    Finally, after my Protocol block, I needed to define my own data type.  IPP defines its own data format to specify printer uptime, so I created a data type for it as shown below in Figure 4.

     

    //Uptime format spec

    Number UPTIME

    {

          Size = 4;

          DisplayFormat = (this != 0) ? FormatString("%i days %02i:%02i:%02i (%i seconds)",

                (this/86400), //Days

                (this-((this/86400)*86400))/3600, //Hours

                (this-(((this/86400)*86400)+(((this-((this/86400)*86400))/3600)*3600)))/60, //Minutes

                (this%60), //Seconds

                this) : "0"

    }

    Figure 4: Custom data type

     

    The first line of Figure 4 specifies this will be a data type composed of numeric data named UPTIME.  Size specifies how many bytes the type uses.  DisplayFormat is what Netmon displays for this type. In this case, I use the x ? y : z syntax.  Netmon doesn’t have if/then/else keywords, but instead uses this ternary operator.  I use a special case for 0 since it seems to be a common return value in the traces I’ve looked at, and having ‘Uptime: 0 days 00:00:00 (0 seconds)’ seemed excessive.

     

    Figures 5 and 6 below show what the result looks like in Netmon.

    Figure5

    Figure 5: Frame Summary

     

    Figure6

    Figure 6: Frame Detail

     

    So what did the trace show?  Windows attempts to send IPP requests with no authentication first, then if it receives an access denied, retries with authentication.  This is by design, as the IPP server replies with the authentication types it supports in the access denied message.  In the case of print jobs that are too large to fit in a single packet, IPP’s spec allows servers to either issue the access denied message as soon as it receives the first packet, or after it has received the entire job.  It turns out that the IPP Print Provider on Windows was designed to send the entire job before listening for a response, so it missed the access denied message that CUPS sent after it received the first packet.  See http://support.microsoft.com/kb/976988/ for related information.  Want a copy of the IPP parser? It will be included in a future release of the Netmon Parser Update.

     

    I hope this post have given you a better idea of how Netmon works, how IPP works, and helps if you ever need to write a parser for your protocol.

    -Matt Burrough

Page 8 of 24 (238 items) «678910»