• Ntdebugging Blog

    Transcript of Windows NT Debugging Blog Live Chat

    • 3 Comments

    For those of you that could not make the live chat on 8/13, here is the transcript of the chat session....

     

    Chat Topic: PGES-Windows NT Debugging Blog Live Chat
    Date: Wednesday, August 13, 2008

    Daniel (Moderator):
    Hello everyone-- thanks for coming to our chat on Platforms Global Escalation Services. The chat will officially get started at 1pm Eastern time. Only questions related to this topic will be addressed during this chat. Thanks!

    Daniel (Moderator):
    Hello everyone-- thanks for coming to our chat on Platforms Global Escalation Services. We'll get started in about 10 minutes.  You can start posting your questions now if you'd like and when the chat starts our Experts will begin answering them. Be sure to check the "Ask the Experts" box before you send your questions and please keep all questions on topic-- Thanks!

    Daniel (Moderator):

    Let's get started with our chat. Before we begin, though, I'd like to have our Experts introduce themselves and then they'll get started answering  your questions.

    Smoke [Windows Core] (Expert):
    Hi everyone, I'm an Escalation Engineer with the Window’s Core team.  I fix bugs for a living.

    Matthew [MSFT EE] (Expert):
    Hello, I am an Escalation Engineer with the Platforms Global Escalation Services (Windows Core) team.

    East - MSFT EE (Expert):
    I am East, an Escalation Engineer with the Microsoft Platforms Global Escalation Services. (Windows Core)



    Todd Webb - Msft (Expert):
    I am an Escalation Engineer with the Microsoft Platforms Global Escalation Services OEM hardware team...

    David (Expert):
    Hi, I'm an Escalation Engineer with Windows Core - reading code & debugging is my day-to-day.

    stheller (Expert):
    Hi, I'm a new Escalation Engineer with Platforms GES.

    Mr Ninja [MSFT EE] (Expert):
    Hi, I am an Escalation Engineer with Microsoft PGES.  I debug Windows for a living.

    Tate [MSFT EE] (Expert):
    Hi, I’m one of the EE’s on the Windows team.



    Jeff Dailey MSFT EE (Expert):
    Hi, my name is Jeff Dailey, I’m a Senior Escalation Engineer on the Microsoft Platforms Global Escalation Services team.



    Smoke [Windows Core] (Expert):
    Q: How can I track memory allocations through MmAllocateContiguousMemory?
    A: You could try poolhittag on MMCM or a breakpoint on MmAllocateContiguousMemory.  If you go with the break point, you can use a conditional breakpoint and dump the stack and anything else, then 'go' the system.  There will be a perf hit each time you break in.

    Tate [MSFT EE] (Expert):
    Q: For MmAllocatecontiguousMemory, will !poolused show the total amount used?
    A: !poolused 2 will show MmCm

    Matthew [MSFT EE] (Expert):
    Q: What's the best way to go about troubleshooting pool corruption dumps.
    A: Special Pool can be used to track down pool corruption problems.  http://msdn.microsoft.com/en-us/library/cc265889.aspx

    a-hstein (Expert):
    Greetings and sorry for the late message.  I am an intern in the GES group.

    Mr Ninja [MSFT EE] (Expert):
    Q: Could you explain the reasons why a memory dump analysis show an "illegal instruction" exception raised from a valid instruction?
    A: There are many reasons this could happen.  The instruction that was executed may not be what you see due to hardware problems such as a bit flip in the instruction when it was executed.  It is also possible for a hardware problem caused an exception to be raised on a valid instruction.  Sometimes software, or hardware, may trigger a jump to the middle of an instruction so that the instruction being executed is not what you think it is.I described a problem where we executed from the middle of an instruction in the blog http://blogs.msdn.com/ntdebugging/archive/2008/04/28/ntdebugging-puzzler-0x00000004-this-didn-t-puzzle-the-debug-ninja-how-about-you.aspx.

    Smoke [Windows Core] (Expert):
    Q: We use APC's to perform certain operations one of them is to have thread cleanup and exit.Is calling thread exit from an APC recommended ? This used to work fine, but with newer service packs we have threads exiting holding the heaplock!
    A: This sounds like a bad idea.  I would expect different ways that this could break (just like you have observed).

    David (Expert):
    Q: We use APC's to perform certain operations one of them is to have thread cleanup and exit.Is calling thread exit from an APC recommended ? This used to work fine, but with newer service packs we have threads exiting holding the heaplock!
    A: Part of the problem is that if ExitThread is called, any pending APCs on that thread's queue are lost.

    Matthew [MSFT EE] (Expert):
    Q: This question is in reference to special pool mentioned already. Is this article essentially the same as the MSDN reference?  http://support.microsoft.com/kb/188831/en-us
    A: The KB article documents enabling special pool via the registry, rather than verifier.  These are two different ways to accomplish the same thing.  Enabling it via the registry is sometimes preferred, since verifier enables additional checks beyond special pool.

    East - MSFT EE (Expert):
    Q: We use APC's to perform certain operations one of them is to have thread cleanup and exit.Is calling thread exit from an APC recommended ? This used to work fine, but with newer service packs we have threads exiting holding the heaplock!
    A: Would this help KB254956

     - If not we would need to follow-up with you for more information

    East - MSFT EE (Expert):
    Are there anything additional you want on the blog that we have not done?



    Jeff Dailey MSFT EE (Expert):
    Q: The final version of the Windows Internals Exam should be available before December 2008.  I’d like to thank all the community members that participated in the Beta.  Your feedback was very valuable.



    East - MSFT EE (Expert):
    Q: Are there anything additional you want on the blog that we have not done?



    Jeff Dailey MSFT EE (Expert):
    Q: When is the next Windows Internals exam scheduled? I would like to plan ahead.
    A: The final version of the Windows Internals Exam should be available before December 2008.  I’d like to thank all the community members that participated in the Beta.  Your feedback was very valuable.



    Matthew [MSFT EE] (Expert):
    Q: Will we get more puzzler on the blog?
    A: We’d like to do more puzzlers, but unfortunately they tend to take a lot of time, so I cannot say for sure when/if we’ll have more.



    Matthew [MSFT EE] (Expert):
    Q: How many of you in the audience are interested in more puzzlers on the ntdebugging blog?

    Smoke [Windows Core] (Expert):
    Q: Are you planning to write a book?
    A: Windows Internals is a great reference book that we all rely upon.  Additionally, you can check out: <http://www.amazon.com/Advanced-Debugging-Addison-Wesley-Microsoft-Technology/dp/0321374460>



    Tate [MSFT EE] (Expert):
    Q: As far as the blog is concerned I'm more a fan of the case studies type posts where you go through how you troubleshooted issues that you have enountered.
    A: So are we!!!

    Smoke [Windows Core] (Expert):
    Q: I'm very interested in puzzlers...
    A: Thanks for the feedback.  We will try to create some more in the future.

    Smoke [Windows Core] (Expert):
    Q: Debugging MPI apps - sometimes a crash happens on remote and the local smpd daemon will terminate the process being debugged. Using the debugger, is there a way to guard from TerminateProcess from the child? I guess that would break some security models.
    A: I'm not sure what MPI is, but this scenario sounds just like a service.  The service control manager will kill the service if it doesn't respond in a timely fashion.  With a service, there is a registry key to extend the timeout.  If such a mechanism isn't available for you, you should consider instrumentation/logging.

    East - MSFT EE (Expert):
    Q: I just skimmed over KB254956, we found APC to work. The issue here is that there are alertable waits in library modules like LSA/NDR/I_RPC calls where our APC fires which raises an user exception, gets handled and exits-thread exits holding the heap lock.
    A: We would need to discuss this further offline, how can I contact you?

    Matthew [MSFT EE] (Expert):
    Q: An award of puzzler like next edition of Windows Internals would definitely have my full attention. :)
    A: We'll consider it... thanks for the feedback!

    Jeff Dailey MSFT EE (Expert):
    Q: Have you ever found yourselves with an "unsolvable" case? :P
    A: No cases is unsolvable, nothing is truly random.  Some cases may take a very long time to resolve through multiple debugging passes, detailed code review, reverse engineering and multiple iterations of instrumentation.  In the end we find the problem.  



    Daniel (Moderator):

    Just a heads-up --we have about 15 minutes left in today's chat. Be sure to post your questions asap and our Experts will try to answer as many as possible before the chat ends. Thanks.

    Mr Ninja [MSFT EE] (Expert):
    Q: Tri-boot machine - XP, Server 2003 and Server 2000 with 2000 being the last one installed. After awhile, I got an error: "Windows 2000 could not start because the following file is missing or corrupt: \WINDOWS\SYSTEM32\CONFIG\SYSTEMd startup options for"..
    A: That is usually a known issue in Windows 2000 caused by the size of the system hive becoming too large.  We have several KB articles that describe this issue  KB269075, KB306038, KB323148, and KB277222 contain various resolutions you can try.  I have found that most often the steps in KB277222, using scrubber in a shutdown script, resolve this problem.  Starting with Windows 2003 we changed the boot architecture to prevent this problem, KB302594 describes this improvement.

    Tate [MSFT EE] (Expert):
    Q: Do you guys use USB debugging in Vista/2008? Why is that there is still one vendor that sells the debug dongle?
    A: Serial debugging works well enough most times.  Usually only if we have hardward that doesn't have a serial connection for some reason and only has USB or Firewire we try these alternates...

    East - MSFT EE (Expert):
    Q: I just skimmed over KB254956, we found APC to work. The issue here is that there are alertable waits in library modules like LSA/NDR/I_RPC calls where our APC fires which raises an user exception, gets handled and exits-thread exits holding the heap lock.
    A: On a better note it would be best to open a case with Microsoft Support - > <http://support.microsoft.com/> -> Need more help? -> Select a Product to start



    Jeff Dailey MSFT EE (Expert):
    Q: What companies are in attendance today?



    Graham (Expert):
    Q: There are lots of post mortum debuggers available, Dr Watson, NTSD, windbg, userdump, WER. Which ones do you usually recrommend your customers to use if you need to be sure to capture a dump from a crash?
    A: Userdump.exe is quite reliable for obtaining post-mortem dumps, and is easy to use.  It (along with ADPlus, which uses CDB) are good because they attach to the process and monitor exceptions, and can create dumps for times when a JIT debugger would not be able to create a thread in the process to obtain the dump.  Normally, I will set up drwtsn32 first, and if it cannot generate the dump, then I will go to userdump. 

    Smoke [Windows Core] (Expert):
    Q: How can I debug cases in which just I have the Minidump for CPU Hog? I tried !runaway and does not works
    A: The minidump alone may not be enough information.  You could try to look at the stacks and guess at what is using the CPU, but that require familiarity with the application.  You should capture a circular perfmon log with thread data.  Then get 3-5 dumps of the app.  From the perfmon log, you'll see what threads are active (and their activity profile).  From the dumps, you'll have a few snapshot of the process in motion.  Alternatively you could try a profiler like xperf.

    David (Expert):
    Q: Are there any free code coverage tools on Windows?
    A: This article describes how to obtain code coverage data:

    David (Expert):
    A: http://msdn.microsoft.com/en-us/library/ms182496.aspx

    stheller (Expert):
    http://www.microsoft.com/whdc/devtools/tools/prefast.mspx discusses the PREfast static source code analysis tool

    East - MSFT EE (Expert):
    Q: Are there any free code coverage tools on Windows?
    A: Please keep watching our blog site for the next chat - <http://blogs.msdn.com/ntdebugging>  or you can submit the question to the our blog site

    Daniel (Moderator):
    Well we're out of time for today's chat. Thank you very much to all of our guests who joined us today as well as to our Experts for answering so  many great questions. Have a great day!

     

  • Ntdebugging Blog

    New Facebook group: “Escalation Engineers”

    • 1 Comments

    Are you the final tier of escalation at a company or group that supports software?
    Are you fluent in assembly, C, C++, etc?

    Are you the voice of reason in critical situations?

    Do you spend more time debugging other people’s code than writing your own?

     

     

    If you answered the above questions with “yes”, then this new Facebook group is for you. "Escalation Engineers" http://www.facebook.com/group.php?gid=23477747996

    Jeff-

  • Ntdebugging Blog

    How to Access the User Mode Debugger from the Kernel Debugger

    • 4 Comments

    In certain cases you may want to use a user mode debugger to debug a process from within the kernel debugger.    It could be that you have an application that loads a kernel mode driver, and you want to be able to debug the user mode aspect of the application and then break into the kernel to follow the calls made to kernel.

    Here is how you do it!

    ·         Attach the kernel debugger via a serial cable (Null modem cable), USB cable or FireWire cable, and have your machine configured to be kernel debugged. The article located at  http://support.microsoft.com/kb/151981  is a good reference for pre-Vista systems.  To enable the debug options on Vista or Windows 2008 you must use bcdedit.exe because those OSes no longer use a boot.ini file. Here’s an example:

     

    bcdedit /debug {<guid>} <ON | OFF>
    bcdedit /dbgsettings SERIAL DEBUGPORT:1 BAUDRATE:115200

     

    ·         Add a new debugger key to the “Image File Execution Options” for your process.  In this case we will use notepad.exe as the target process. The new key will look like this:

     

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\notepad.exe

     

    ·         Add a string value under this key called “debugger” that contains the value “ntsd –d”. Here’s a screen shot of the registry changes for reference.

     

    ·         The –d option redirects the output of NTSD to the kernel debugger allowing remote control via the kernel debugger.

     

    ·         With the existence of this new key, the user mode debugger will automatically start and attach to your process when Notepad.exe starts.  Note: It’s important to remove the registry entry when you’re finished debugging.

     

    ·         You can now issue any standard NTSD Command via the kernel debugger.

     

    ·         When you are ready to break into the kernel and run under the kernel debugger simply type .breakin

     

     

    Jeff- 

  • Ntdebugging Blog

    Windows NT Debugging Blog Live Chat

    • 1 Comments

    Microsoft Platform Global Escalation Services is hosting our first live group debug chat session for the debugging community on August 13, 2008 at 10 AM PT.  We will be focusing on debugging techniques and any questions you may have about anything we’ve previously blogged about.  Also, we will try to cover some topics that were requested here.

     

    Details about the “PGES-Windows NT Debugging Blog Live Chat” can be found here: http://www.microsoft.com/communities/chats/default.mspx

  • Ntdebugging Blog

    How can I find out why the Cluster Resource Monitor dumped – Access Violation

    • 4 Comments

    Hello, my name is John Marlin, and I am a Support Escalation Engineer on the Microsoft Platform Cluster Services Support team.  I wanted to talk about the Windows 2003 Cluster Resource Monitor along with what happens when it crashes as well as how to debug it to find out why it crashed.

     

    We need to first understand what the Cluster Resource Monitor is and does.  Below is taken from the Microsoft MSDN site describing the Cluster Resource Monitor.

     

    A Resource Monitor provides a communication, monitoring, and processing layer between the Cluster service and one or more resources. Resource Monitors have the following characteristics:

    ·         A Resource Monitor always runs in a process separate from the Cluster service. If a resource fails, the Resource Monitor isolates the Cluster service from the effects. If the Cluster service fails, the Resource Monitor allows its resources to shut down gracefully.

    ·         To work with a resource, a Resource Monitor loads the resource DLL responsible for that resource type into its process.

    ·         When the Cluster service requests an operation on a resource, the Resource Monitor routes the request to the appropriate entry point function of the resource DLL responsible for the resource. The Resource Monitor performs default processing for some resource operations.

    ·         A Resource Monitor stores synchronized state data, allowing the Cluster service and resource DLLs to operate asynchronously, checking and updating resource status as needed.

    ·         A Resource Monitor periodically checks the operational status of all of its resources. For more information on this process, see Resource Failure.

     

    By default, the Cluster service creates one Resource Monitor per node.

     

    As the article states, everything currently running on the node is in the one Resource Monitor.  If the Resource Monitor crashes, the system will dump the Resource Monitor Process to a file called RESRCMON.DMP, and create a new instance of the process.  Because it must create a new one, all resources in the monitor are gone and needs to be restarted.  When this occurs, you would see the following entry in the Windows System Event Log. 

     

    Event ID:  1146

    Source:  ClusSvc

    Description:  The cluster resource monitor died unexpectedly, an attempt will be made to restart it

     

    After this, you could also see other resource failures (Event ID: 1069) as well as disk related events such as Lost Delayed Writes, etc.  You would see the disk related events because the disk(s) would be considered down and since there is data in the cache of the HBA, it has nowhere to write it.  Hence, lost delayed writes exist until the disk is brought back online.  For our examples here, we will ignore these disk related events as we will focus on why the Resource Monitor crashed.

     

    There are a couple reasons why a Resource Monitor would crash such as an Access Violation (0xC0000005) or a Deadlock (0xC0000194).  For this blog, we will talk about the Access Violation (0xC0000005).  An Access Violation will occur because a resource tried to do something it wasn’t supposed to or it is having an issue starting up.

     

    Along with the above System Event (Event ID: 1146) where the Resource Monitor died, you will see this in the Cluster Log file. 

     

    NOTE:

    The Cluster Log will convert times to Greenwich Mean Time (GMT), so you must ensure you do the proper GMT conversion of time to get to the location in the Cluster Log.

     

    00001d6c.00001b60::2008/03/04-05:28:46.114 ERR  [RM] Exception. Code = 0xc0000005, Address = 0x781449D1

    00001d6c.00001b60::2008/03/04-05:28:46.114 ERR  [RM] Exception parameters: 0, 0, 1003f, 0

    00001d6c.00001b60::2008/03/04-05:28:46.114 INFO [RM] GenerateMemoryDump: Start memory dump to file C:\WINDOWS\Cluster\resrcmon.dmp

     

    Now that we see this entry in the log, we should take a look at the Resource Monitor dump to see what caused the failure.  The first thing to examine is the register states, specifically the ESP (stack pointer) value.

     

    0:023> r

    eax=01bf0000 ebx=000f7b88 ecx=00000007 edx=7c8285ec esi=000f7b60 edi=000f7bb8

    eip=7c8285ec esp=01aed598 ebp=01aed5a8 iopl=0         nv up ei pl zr na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

    ntdll!KiFastSystemCallRet:

    7c8285ec c3              ret

     

    Starting at the stack pointer address 01aed598, we use the dds command to dump the raw stack.  We are looking for the value on the stack just below the routine resrcmon!GenerateMemoryDump.  It will take several iterations of the dds command to finally get to the value because the call was made much earlier in the stack.

     

    0:023> dds 01aed598

    01aed598  00740061 xpsp2res.dll

     xpsp2res+0x100061

    01aed59c  00720075 xpsp2res+0xe0075

    01aed5a0  00730065 xpsp2res+0xf0065

    01aed5a4  00610000

    *** pages removed ***

    01aedddc  0026afd8

    01aedde0  0026af28

    01aedde4  01aee034                                        <<-- pointer to Exception address stack

    01aedde8  0100e638 resrcmon!GenerateMemoryDump+0x180

    01aeddec  ffffffff

    01aeddf0  00001d6c

    01aeddf4  00000018

     

    Now that we have our value, we will use the kv command with the value 01aee034 to dump out the stack contents.

     

    0:023> kv=01aee034

    ChildEBP RetAddr  Args to Child             

    01aed628 7c826d2b 77e63eb3 000004e0 00080178 ntdll!KiFastSystemCallRet (FPO: [0,0,0])

    01aed62c 77e63eb3 000004e0 00080178 6d5b5af6 ntdll!ZwClose+0xc (FPO: [1,0,0])

    01aee034 0100e989 01aee300 01003024 00000000 kernel32!CloseHandle+0x59 (FPO: [Non-Fpo])

    01aee04c 01008b2c 01aee300 01003024 01aee300 resrcmon!GenerateExceptionReport+0x7e (FPO: [Non-Fpo])

    01aee060 76348d17 01aee300 01aee300 01aee080 resrcmon!RmpExceptionFilter+0x14 (FPO: [Non-Fpo])

    01aee070 7786d6d2 01aee300 77ecb7c0 01aee2d8 netshell!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])

    01aee080 77e761b7 01aee300 00000000 00000000 netman!__CxxUnhandledExceptionFilter+0x4a (FPO: [Non-Fpo])

    01aee2d8 77e792a3 01aee300 77e61ac1 01aee308 kernel32!UnhandledExceptionFilter+0x12a (FPO: [Non-Fpo])

    01aee2e0 77e61ac1 01aee308 00000000 01aee308 kernel32!BaseThreadStart+0x4a (FPO: [SEH])

    01aee308 7c828752 01aee3ec 01aeffdc 01aee408 kernel32!_except_handler3+0x61 (FPO: [Uses EBP] [3,0,7])

    01aee32c 7c828723 01aee3ec 01aeffdc 01aee408 ntdll!ExecuteHandler2+0x26

    01aee3d4 7c82855e 01ace000 01aee408 01aee3ec ntdll!ExecuteHandler+0x24

    01aee3d4 781449d1 01ace000 01aee408 01aee3ec ntdll!KiUserExceptionDispatcher+0xe (FPO: [2,0,0]) (CONTEXT @ 01aee408)

    01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4 (FPO: [Non-Fpo])

    WARNING: Stack unwind information not available. Following frames may be wrong.

    01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851

    01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364

    01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_endthreadex+0x3b (FPO: [Non-Fpo])

    01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_endthreadex+0xc7 (FPO: [Non-Fpo])

    01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

     

    Based on the stack above, we have an exception at address 0x01aee300 which we will use to set the failing context.

     

    0:023> dc 0x01aee300

    01aee300  01aee3ec 01aee408 01aee32c 7c828752  ........,...R..|     <<-- Exception and Context Records

    01aee310  01aee3ec 01aeffdc 01aee408 01aee3c8  ................

    01aee320  01aeff6c 7c828766 01aeffdc 01aee3d4  l...f..|........

    01aee330  7c828723 01aee3ec 01aeffdc 01aee408  #..|............

    01aee340  01aee3c8 77e61a60 00000000 01aee3ec  ....`..w........

    01aee350  01aeffdc 7c8315c2 01aee3ec 01aeffdc  .......|........

    01aee360  01aee408 01aee3c8 77e61a60 00000000  ........`..w....

    01aee370  01aee3ec 00000000 00000000 00000000  ................

     

    The first DWORD is the Exception Record (0x01aee3ec) and the second DWORD is the Context Record (0x01aee408) that holds our true stack where the problem occurred.  From the Exception Record, we can see it is an Access Violation.

     

    0:023> .exr 0x01aee3ec

    ExceptionAddress: 781449d1 (msvcr80!wcslen+0x00000004)

       ExceptionCode: c0000005 (Access violation)

      ExceptionFlags: 00000000

    NumberParameters: 2

       Parameter[0]: 00000000

       Parameter[1]: 00000000

     

    So we need to jump into the saved context to get the thread that caused the Resource Monitor to crash.

     

    0:023> .cxr 0x01aee408

    eax=00000000 ebx=00f15d30 ecx=00000000 edx=00000000 esi=00000000 edi=00000000

    eip=781449d1 esp=01aee6d4 ebp=01aee6f8 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202

    msvcr80!wcslen+0x4:

    781449d1 668b08          mov     cx,word ptr [eax]        ds:0023:00000000=????

     

    0:023> kv

      *** Stack trace for last set context - .thread/.cxr resets it

    ChildEBP RetAddr  Args to Child             

    01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4 (FPO: [Non-Fpo])

    WARNING: Stack unwind information not available. Following frames may be wrong.

    01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851

    01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364

    01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_endthreadex+0x3b (FPO: [Non-Fpo])

    01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_endthreadex+0xc7 (FPO: [Non-Fpo])

    01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34 (FPO: [Non-Fpo])

     

    This stack reveals that JohnApp’s .DLL was in the process of a calling one of its threads to do something and caused this problem.  Now we can find out what the specific resource (in case there are multiple) that caused the problem.  In the case of an Access Violation dump, it is going to be a resource that failed or is in the process of coming online.  You can do a search through the threads using ~*kb to find a current resource trying to startup during a resrcmon termination.

     

    0:023> ~*kb

     

      10  Id: 1d6c.1394 Suspend: 0 Teb: 7fff4000 Unfrozen

    ChildEBP RetAddr  Args to Child             

    0179f78c 7c827d0b 77e61d1e 00000464 00000000 ntdll!KiFastSystemCallRet

    0179f790 77e61d1e 00000464 00000000 00000000 ntdll!NtWaitForSingleObject+0xc

    0179f800 77e61c8d 00000464 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac

    0179f814 10002728 00000464 ffffffff 77e61d48 kernel32!WaitForSingleObject+0x12

    WARNING: Stack unwind information not available. Following frames may be wrong.

    0179f838 0100a352 0009a6a8 0179f888 0100864d JohnApp!Startup+0x1268

    0179f844 0100864d 000a88a0 0179f8a0 0179fa94 resrcmon!Resmon_Terminate+0x14          <<-- Resource we want

    0179f888 01009c3b 000a88a0 00000000 0179f8a4 resrcmon!RmpOfflineResource+0x2f1       <<-- Resource we want

    0179f89c 77c80193 00000003 02460246 00000001 resrcmon!s_RmTerminateResource+0x13

    0179f8b4 77ce33e1 01009c28 0179fa98 00000001 rpcrt4!Invoke+0x30

    0179fcb4 77ce35c4 00000000 00000000 000ef734 rpcrt4!NdrStubCall2+0x299

    0179fcd0 77c7ff7a 000ef734 0008f980 000ef734 rpcrt4!NdrServerCall2+0x19

    0179fd04 77c8042d 0100c24c 000ef734 0179fdec rpcrt4!DispatchToStubInCNoAvrf+0x38

    0179fd58 77c80353 00000005 00000000 01011458 rpcrt4!RPC_INTERFACE::DispatchToStubWorker+0x11f

    0179fd7c 77c7e0d4 000ef734 00000000 01011458 rpcrt4!RPC_INTERFACE::DispatchToStub+0xa3

    0179fdbc 77c7e080 000ef734 000ef6ec 00000000 rpcrt4!RPC_INTERFACE::DispatchToStubWithObject+0xc0

    0179fdfc 77c812f0 0008fb38 00083498 000efcf8 rpcrt4!LRPC_SCALL::DealWithRequestMessage+0x41e

    0179fe20 77c88678 000834d0 0179fe38 0008fb38 rpcrt4!LRPC_ADDRESS::DealWithLRPCRequest+0x127

    0179ff84 77c88792 0179ffac 77c8872d 00083498 rpcrt4!LRPC_ADDRESS::ReceiveLotsaCalls+0x430

    0179ff8c 77c8872d 00083498 00000000 00000000 rpcrt4!RecvLotsaCallsWrapper+0xd

    0179ffac 77c7b110 00086890 0179ffec 77e64829 rpcrt4!BaseCachedThreadRoutine+0x9d

     

    With the above information, and focusing on the resrcmon!RmpOfflineResource frame, the parameter 0x000a88a0 is our resource.  Using the DC command, you can confirm that this is the resource.

     

    0:023> dc 000a88a0

    000a88a0  63727352 00000001 000a7d90 000a9290  Rsrc.....}......

    000a88b0  000b37f0 000ad130 000a8318 000ad0f0  .7..0........... 

    000a88c0  00001388 0000ea60 10000000 0009a6a8  ....`...........

    000a88d0  00000000 00000000 00000000 00000000  ................

    000a88e0  00000000 00000001 10001540 10001910  ........@.......

    000a88f0  10001a00 10002280 100026b0 10002810  ....."...&...(..

    000a8900  100027c0 00000000 00000000 100028a0  .'...........(..

    000a8910  100029e0 00000003 00000000 0000000c  .)..............

     

    The DWORDS at offset 10, 14, 18, and 1C will reveal the information to confirm the resource such as its GUID in the registry, the .DLL being used, etc.

     

    0:005> du 0x000ad0f0                            <<-- Resource Displayed in Cluster Administrator

    000ad0f0  "Johns Resource"

     

    0:005> du 0x000a8318                            <<-- GUID in registry (HKLM\Cluster\Resources)

    000a8318  "35a73cba-6096-485e-a227-d4a8d06f"

    000a8358  "680a"

     

    0:005> du 0x000ad130                            <<-- Resource Type (HKLM\Cluster\ResourceTypes)

    000ad130  "Johns Customer Resource Type"

     

    0:005> du 0x000b37f0                            <<-- Specific DLL being Used

    000b37f0  "johnapp.dll"

     

    Now that you know the resource generating the access violation, you should consult with the vendor of this resource to find out what had happened.  It’s possible they can look at the threads with their symbols or get other information from the event logs, any dump it may create, etc.  It could be the resource has a known problem where an update is needed.

     

    The above steps will take you to where you want and I wanted to show those steps first.  In most cases, but not all, when the dump is first opened it could show the exception and the stack that we found above by simply entering the .ecxr command...

     

    0:023> .ecxr

    eax=00000000 ebx=00f15d30 ecx=00000000 edx=00000000 esi=00000000 edi=00000000

    eip=781449d1 esp=01aee6d4 ebp=01aee6f8 iopl=0         nv up ei pl nz na po nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202

    msvcr80!wcslen+0x4:

    781449d1 668b08          mov     cx,word ptr [eax]        ds:0023:00000000=????

     

    0:023> kb

      *** Stack trace for last set context - .thread/.cxr resets it

    ChildEBP RetAddr  Args to Child             

    01aee6d0 10006d11 00000000 00f16914 01aeff58 msvcr80!wcslen+0x4

    WARNING: Stack unwind information not available. Following frames may be wrong.

    01aee6f8 10001364 100096e4 00000000 744eecf8 JohnApp!Startup+0x5851

    01aeff78 781329bb 0009a6a8 75b03b60 00000000 JohnApp+0x1364

    01aeffb0 78132a47 00000000 77e64829 00f15d30 msvcr80!_callthreadstartex+0x1b

    01aeffb8 77e64829 00f15d30 00000000 00000000 msvcr80!_threadstartex+0x66

    01aeffec 00000000 781329e1 00f15d30 00000000 kernel32!BaseThreadStart+0x34

     

    You could also get to the same information using the above original steps (dds 01aed598) but stopping at the resrcmon!RmpExceptionFilter (Resource Monitor handles the exception) which has the exception as the first parameter.

     

    0:023> dds 01aed598

    01aed598  00740061 xpsp2res.dll

     xpsp2res+0x100061

    01aed59c  00720075 xpsp2res+0xe0075

    01aed5a0  00730065 xpsp2res+0xf0065

    01aed5a4  00610000

    *** pages removed ***

    01aedddc  0026afd8

    01aedde0  0026af28

    01aedde4  01aee034                                        <<-- pointer to Exception address stack

    01aedde8  0100e638 resrcmon!GenerateMemoryDump+0x180

    01aeddec  ffffffff

    01aeddf0  00001d6c

    01aeddf4  00000018

    *** pages removed ***

    01aee000  01005528 resrcmon!`string'+0xc

    01aee004  ffffffff

    01aee008  0100d27b resrcmon!ClRtlLogPrint+0x499

    01aee00c  0100e96c resrcmon!GenerateExceptionReport+0x61

    01aee010  00000001

    01aee014  01005bf4 resrcmon!`string'

    *** pages removed ***

    01aee048  7786d687 netman!__CxxUnhandledExceptionFilter

    01aee04c  01aee060                                        <<-- Frame 3 in kv=01aee034 above

    01aee050  01008b2c resrcmon!RmpExceptionFilter+0x14       <<-- Frame 4 in kv=01aee034 above

    01aee054  01aee300

    01aee058  01003024 resrcmon!`string'

    01aee05c  01aee300

     

  • Ntdebugging Blog

    What Are the Odds?

    • 2 Comments

     

    Hi NTDebuggers, something rarely talked about are the odds of a problem being in one piece of code vs. another.   From time to time we see some very strange debugs or symptoms reported by customers.  The problems can be associated with anything from an internally written application, a Microsoft product running on Windows, or an application written by a 3rd party vendor.  In fact we are often engaged to assist one of our customers or vendors with troubleshooting or debugging their applications. 

     

    One of the first things we do is assess the situation.  We ask questions like:

    ·         Where is the program crashing?

    ·         What binaries comprise the program?

    ·         How often are those various binaries used worldwide?

     

    Let’s use the following pseudo call stack and binaries as an example.

     

    NTDLL!VeryCommonFunction  << Crash happens in this function.

    ADVAPI32!RatherCommonFunction

    MYCustomApp!RarelyUsedCode

    MyCustomApp!Main

     

    If I see a crash in NTDLL!VeryCommonFunction I’m going to make some assumptions as I assess the domain of the problem.  This holds true for any operating system, product, or software in general.  The code that runs more than any other code is, by its nature, effectively tested more because it runs more.  Therefore it is less likely to be the root cause of the fault, and in some cases it is simply the victim.   This applies to all operating systems: UNIX, Mac OS, Windows... core code tends to be less buggy.

     

    Let’s look at a real world example of some very common code in Windows.  NTDLL!RtlAllocateHeap and NTDLL!RtlFreeHeap.  For those of you not familiar with NTDLL, it’s loaded in just about every process on every machine running a modern copy of Windows, worldwide.  The average machine has ~40-200+ process (applications, and miscellaneous services running), and there are hundreds of millions of PCs worldwide running Windows, so that gives us ~billions of processes running NTDLL,  give or take a few billion.  Collectively, those processes are going to call RtlFreeHeap or RtlAllocateHeap millions of times in the next second.

     

    So what are the odds?  Is it likely that this core API used by billions of processes is crashing because of a bug in the core API?  Or is it more likely that a smaller vertical market or custom application running on ~500 machines worldwide did something to destabilize one of the process heaps?   

     

    Typically when an application is crashing in a heap function inside of NTDLL, support engineers become suspicious of activity in the process space, and in this case it’s more likely to be a problem with heap corruption.  It is likely that code running in the host process that has NTDLL loaded has corrupted one of the heaps by overwriting a buffer, doing a double free, or some other problem.  Then when a call is made into the Microsoft heap API, NTDLL has to traverse the heap structures that are corrupted by the host application, so the process crashes.  And yes, the crash is in NTDLL.   In this case, I typically ask the customer to enable full page heap via gflags (this puts an additional page marked with the PAGE_NOACCESS attribute at the end of each allocation).  We then wait for the next crash and analyze it.  Enabling full page heap helps you catch the corruptor with “their hand in the cookie jar”.

     

    The same scenario holds true for other core functionality such as kernel pool allocations, invalid handles, leaks etc.  Again, core code tends to be rock solid because of sheer volume of use and exposure to a variety of environments. This being the case, it also tends to change less over time.  Of course there is code in the OS or other components that is not used as much, which is more likely to have problems.   We always take that into consideration when scoping an issue. 

     

    The good news is we are always happy to dig in and help our customers isolate these types of problems.

     

    Please feel free to chime in and share your stories.


    Good Luck and happy debugging.

     

    Jeff-

  • Ntdebugging Blog

    Data Execution Protection in Action

    • 5 Comments

    Hello, my name is Graham, and I’m an escalation engineer on the Platforms Global Escalation Team.  I recently worked a case where a group of Windows XP machines were hitting a bugcheck on boot, error 0xC000021A.   This error occurs when a critical usermode process such as winlogon or csrss crashes.  I had access to a failing machine, so I attached the kernel debugger to find out why winlogon was crashing.  I found the cause, and a little bit more about Data Execution Prevention (DEP) in the process.

     

    The initial debugger spew gave me this information:

     

    *** An Access Violation occurred in winlogon.exe:

     

    The instruction at 10030F90 tried to write to an invalid address, 10030F90

     

     *** enter .exr 0006F4AC for the exception record

     *** enter .cxr 0006F4C8 for the context

     *** then kb to get the faulting stack

     

     

    So I followed its cue and got the exception record and context record:

     

    1: kd> .exr 0006F4AC

    ExceptionAddress: 10030f90

       ExceptionCode: c0000005 (Access violation)

      ExceptionFlags: 00000000

    NumberParameters: 2

       Parameter[0]: 00000008

       Parameter[1]: 10030f90

    Attempt to execute non-executable address 10030f90

     

    Ahh, OK, so we know this is a DEP crash now.

     

    1: kd> .cxr 0006F4C8

    eax=00000400 ebx=00000000 ecx=00000000 edx=00010000 esi=00000000 edi=00084370

    eip=10030f90 esp=0006f794 ebp=0006f81c iopl=0         nv up ei pl nz na pe nc

    cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206

    001b:10030f90 33c0            xor     eax,eax

     

     

    Let's check out the crashing stack to see what's going on:

     

    1: kd> kb

      *** Stack trace for last set context - .thread/.cxr resets it

    ChildEBP RetAddr  Args to Child             

    0006f81c 010297c1 00084370 01010ab4 00000000 3rdparty!nosymbols

    0006fcfc 010312a6 00072364 7c80b6a1 00000000 winlogon!ExecSystemProcesses+0x14d

    0006ff50 0103d4d0 01000000 00000000 00072364 winlogon!WinMain+0x2b6

    0006fff4 00000000 7ffd7000 000000c8 000001ec winlogon!WinMainCRTStartup+0x174

     

     

    The first thing I decided to look for was how we got to this address.  To begin, I unassembled the code right before the return address to winlogon!ExecSystemProcesses.

     

    kd> ub 010297c1

    winlogon!ExecSystemProcesses+0x12e

    010297a2 6a02            push    2

    010297a4 ffb594fbffff    push    dword ptr [ebp-46Ch]

    010297aa 6880000000      push    80h

    010297af 56              push    esi

    010297b0 56              push    esi

    010297b1 68b40a0101      push    offset winlogon!`string' (01010ab4)

    010297b6 ffb5a0fbffff    push    dword ptr [ebp-460h]

    010297bc e891fcffff      call    winlogon!StartSystemProcess (01029452)

     

     

    According to the stack, winlogon!ExecSystemProcesses didn't call the function currently running.  So, I suspected some hooking was going on.  Using !chkimg, I verified this was the case.  Note that chkimg requires a valid copy of the binary in the symbol path.

     

    1: kd> !chkimg -db kernel32

    10 errors : kernel32 (7c802332-7c80236b)

    7c802330  90  90 *e9 *59 *ec *82 *93  6a  00  ff  75  2c  ff  75  28  ff ...Y...j..u,.u(.

    ...

    7c802360  28  00  90  90  90  90  90 *e9 *d4 *eb *82 *93  6a  00  ff  75 (...........j..u

    1: kd> u 7c802330 

    kernel32!WriteProcessMemory+0x10d:

    7c802330 90              nop

    7c802331 90              nop

    kernel32!CreateProcessW

    7c802332 e959ec8293      jmp     3rdparty!nosymbols (10030f90)

     

     

    Aha! Something has hooked CreateProcessW to jump to our current instruction.  Now that we know how we got there, let's understand why we crashed.  Since DEP fired, that means this address is non-executable.  I verified this by dumping out the PTE for the address.

     

    1: kd> !pte 10030F90

                   VA 10030f90

    PDE at 00000000C0600400    PTE at 00000000C0080180

    contains 000000004E102867  contains 800000004E021867

    pfn 4e102 ---DA--UWEV    pfn 4e021 ---DA--UW-V

     

    Notice that in the protection flags for the PTE, the 'E' bit isn't set, saying this page isn't executable.   So, where is this address we were trying to execute?  Many times with DEP crashes this will be in stack or heap memory.  But not this time.  In this case, the address is actually in a module's memory mapped address space, as shown by the 'lm' command

     

    1: kd> lm m 3rdparty

    10000000 1003c000   3rdparty C (export symbols)       3rdparty.dll

     

    Hmm...  So the address falls in this module. Why isn't it executable?   Usually when I think of image files, I think of running code.  But, remembering back to how the PE images are laid out, a module is broken into subsections, with different types of data in each one, and different protection levels.  There's a place in the image for code, and for data, such as global variables and static data.  So, let's dump the image header and find which section offset 0x30F90 is in.

     

    1: kd>!dh 3rdparty

     

    <snip>

    SECTION HEADER #3

       .data name

       1EE3C virtual size

       1A000 virtual address   //  (1A000+1EE3C=0x38e3c so mem range for section is 1A000 to 0x38e3c)

        3000 size of raw data

       1A000 file pointer to raw data

           0 file pointer to relocation table

           0 file pointer to line numbers

           0 number of relocations

           0 number of line numbers

    C0000040 flags

             Initialized Data

             (no align specified)

             Read Write  // no Execute !

     

     

    This is our section, since the virtual address starts at 0x1A000 and is 0x1EE3C in size, putting the end of the section at 0x38e3c.  Our address of 0x30F90 falls between them.

    Sure enough, this section is labeled as "Initialized Data", and the protection flags show Read and Write, but no Execute!  So, this address is not in a code section of the module, and DEP will not allow it to run. 

     

    Knowing this, I was able to find an update on the 3rd party manufacturer's site that modified their DLL to prevent this from occurring.  Mystery solved!

  • Ntdebugging Blog

    Windows Internals Beta Exam 71-660

    • 13 Comments

    Correction, Windows Internals Beta Exam 71-660

    This was previously listed incorrectly as 70-660, the actual exam ID is now 71-660

     

    Update 7-18-2008: The link to the exam preparation information has been updated below.

     

    Update 7-21-2008: When registering, there is no need to use a credit card.  Be sure to instead use the following promotional code listed below when registering for the exam: WINT.  If you have already been charged for this exam, please contact Prometric and request a refund, and reference ticket number 3711938.

     

    Hello Windows Debugging Community,  We are super excited to announce a new certification.   With the help of industry partners, including some of our very talented community members,  we have developed a new Windows Internals Certification.   This Certification targets developers and IT professionals that require knowledge of Windows Internals as part of their profession.   Chance are if you are reading this blog post you are part of the target audience.  I encourage you to register and please take the Beta exam.  Please help us to improve the exam by providing feedback during the beta exam.

     

     

    Register for Beta Exam 71-660: TS: Windows® Internals

     

    You are invited to take beta exam 70-660: TS: Windows® Internals. If you pass the beta exam, the exam credit will be added to your transcript and you will not need to take the exam in its released form.

    By participating in beta exams, you have the opportunity to provide the Microsoft Certification program with feedback about exam content, which is integral to development of exams in their released version. We depend on the contributions of experienced IT professionals and developers as we continually improve exam content and maintain the value of Microsoft certifications.

    Please remember that participation in the beta process is completely voluntary and Microsoft makes no promises or guarantees regarding the beta exam process. You can expect to receive your score on the beta exam within 12 weeks of taking the exam, although in some instances, beta exams may take longer to score and your results may be delayed.

    70-660: TS: Windows® Internals counts as credit towards the following certification.

    ·          Microsoft Certified Technology Specialist (MCTS): Windows® Internals  


    Availability

    Registration begins: July 17, 2008

    Beta exam period runs: July 21, 2008August 5, 2008

     

    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately. Beta exams have limited availability and are operated under a first-come-first-served basis. Once all beta slots are filled, no additional seats will be offered.


    Testing is held at Prometric testing centers worldwide, although this exam may not be available in all countries (see Regional Restrictions).  All testing centers will have the capability to offer this exam in its live version.

     

    Regional Restrictions: India, Pakistan, China


    Registration Information

    You must register at least 24 hours prior to taking the exam.
    Please use the following promotional code when registering for the exam: WINT
    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately.


    To register in North America, please call:

    Prometric: (800) 755-EXAM (800-755-3926)

     

     

    Outside the U.S./Canada, please contact:

    Prometric: http://www.register.prometric.com/ClientInformation.asp

     

     


    Test Information and Support

    You are invited to take this beta exam at no charge.
    You will be given four hours to complete the beta exam. Please plan accordingly.
    Find exam preparation information: http://blogs.msdn.com/ntdebugging/pages/exam-preparation-information-for-exam-71-660.aspx

     


    Frequently Asked Questions

    For Microsoft Certified Professional (MCP) help and information, you may log in to the MCP Web site at http://www.microsoft.com/learning/mcp/or contact your Regional Service Center: http://www.microsoft.com/learning/support/worldsites.asp.

    What is a beta exam?

    Where can I learn more about the registration process?

    Where can I learn more about the beta exam invitation process?

    Where can I learn more about the new structure of Microsoft Certification?

    Who do I contact for help with this beta exam or other MCP questions?

     

    Thank you.


    Jeff Dailey, Microsoft Global Escalation Services.

  • Ntdebugging Blog

    If Cluster is Running Fine, Don’t Dump Cluadmin on a Hang

    • 3 Comments

    Hello, my name is East, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team.  Here are some tips for those wanting to review dumps from Cluster Administrator (Cluadmin.exe), the GUI version of cluster.exe that allows management of your cluster environment. Historically there were more Cluadmin hangs in Windows 2000 Cluster server than Windows 2003 Cluster server so the volume of these dumps submitted to my group has dropped significantly. People have submitted complete memory dumps thinking the entire Cluster server was hung when it turned out to be a Cluadmin hang. It’s important to determine what is really hung when troubleshooting these types of problems.

     

    So what causes Cluadmin to hang? In most cases a resource is working to come online, or a remote connection is misbehaving which results in the hang.  Cluadmin dumps are not as helpful in revealing why a resource is misbehaving in your Cluster environment, when compared to dumping the cluster service and resrcmon. In most cases the cluster log can help you narrow down a problematic resource that’s holding Cluadmin up. Later I will show how to find the possible resource that caused a Cluster Administrator hang by using the debugger.

     

    To help narrow down the hang you should start by testing that the Cluster environment is healthy.

     

    Let’s say at the time of the hang you were using Cluadmin to view properties of certain resources. At this point it would be useful to examine the status of the cluster by using the Cluster.exe tool. Here are some examples.

     

    Using the cluster.exe command you can check the Groups status with the following command:

     

    C:\>cluster group

    Listing status for all available resource groups:

     

    Group                    Node            Status

    -------------------- --------------- --------------

    Test1                   NODE1            Online

    Cluster Group           JNODE1           Online

     

    You can check the Node status using this command:

     

    C:\>cluster node

    Listing status for all available nodes:

     

    Node           Node ID          Status

    -------------- ------- ---------------------

    NODE1            1                 Up

    NODE2            2                 Down

     

    And you can check the Resource status with the resource command:

     

    C:\>cluster resource

    Listing status for all available resources:

     

    Resource                    Group              Node       Status

    -------------------- -------------------- --------------- ------

    Disk Q:                 Cluster Group         NODE1       Online

    tst1                    Test1                 NODE1       Online

    Cluster IP Address      Cluster Group         NODE1       Online

    Cluster Name            Cluster Group         NODE1       Online

    MSDTC                   Cluster Group         NODE1       Online

     

     

    You can get more granular by using the /priv switch to list out the private properties like this:

     

    C:\>cluster resource /priv

    Listing private properties for all resources:

                    <ouput not shown to save space>

     

    Since hangs can occur during startup of Cluadmin I recommend the following article which discusses different ways to start Cluadmin.

     

    280125  Cluster Administrator Switches for Connecting to a Cluster

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;280125

     

    Now let’s take a look at a Cluadmin dump to illustrate what you would see during a hang. Most dumps we gather from customers are snapped when Cluadmin is starting, resulting in limited information contained in the dump.  Here’s an example of the resource that Cluster was working with when the process hung.  The thread below is making an RPC call to the local cluster service.  This can be the culprit causing Cluadmin to be in the hung state:

     

    0:000> kb

    ChildEBP RetAddr  Args to Child             

    0006eddc 7c59a0a2 00000154 00000001 0006edfc NTDLL!ZwWaitForSingleObject+0xb

    0006ee04 77d7f41d 00000154 000007d0 00000001 KERNEL32!WaitForSingleObjectEx+0x71

    0006ee28 77d5fadf 00000000 0006ee58 0006ee50 rpcrt4!DG_ReceivePacket+0xd3

    0006ee68 77d5f998 0006f0c8 0006f07c 0006ee88 rpcrt4!DG_CCALL::ReceiveSinglePacket+0x5e

    0006ee7c 77d3b96c 0006f07c 0006f274 77d6a686 rpcrt4!DG_CCALL::SendReceive+0xd6

    0006ee88 77d6a686 0006f07c 00000009 7393270e rpcrt4!I_RpcSendReceive+0x2c

    0006ee9c 77d93b64 0006f0c8 000a425c 0006eeac rpcrt4!NdrSendReceive+0x31

    0006f274 73939b6d 73931778 739326c0 0006f28c rpcrt4!NdrClientCall2+0x512

    0006f284 73934bd5 00091ed8 01000059 00000000 CLUSAPI!ApiResourceControl+0x14

    0006f2c4 0102705f 000921e8 00000000 01000059 CLUSAPI!ClusterResourceControl+0xd0

    0006f2f8 01027b00 000921e8 01000059 00000000 CLUADMIN!CClusPropList::ScGetResourceProperties+0x3e

    0006f374 010129ae 00c988a0 00000080 00239770 CLUADMIN!CResource::ReadItem+0xb9

    0006f3b4 0101141d 002397c8 00239770 00000001 CLUADMIN!CClusterDoc::InitResources+0x9a

    0006f3ec 01010f45 00000001 00239770 00000000 CLUADMIN!CClusterDoc::CollectClusterItems+0xd5

    0006f410 01010e11 002352d0 00234960 00239770 CLUADMIN!CClusterDoc::OnOpenDocumentWorker+0x80

    0006f444 76fbaea3 002352d0 00234960 01044c28 CLUADMIN!CClusterDoc::OnOpenDocument+0xa2

    0006f46c 0101039f 002352d0 00000001 01044c28 mfc42u!CMultiDocTemplate::OpenDocumentFile+0xb5

    0006fc88 0100fff6 002352d0 002373a0 00000000 CLUADMIN!CClusterAdminApp::OpenDocumentFile+0xbf

    0006fcc0 01021c9e 00000001 00000000 76fb2089 CLUADMIN!CClusterAdminApp::OnRestoreDesktop+0xc4

    0006fccc 76fb2089 00000001 00000000 002373a0 CLUADMIN!CMainFrame::OnRestoreDesktop+0x17

     

    The first parameter passed to CLUSAPI!ClusterResourceControl in this Windows 2000 dump reveals the resource in question.

     

    Here I’m dumping out the raw data of the parameter which yields the name of the resource, Joseph.

     

    0:000> dc 000921e8+28

    00092210  006F004A 00650073 00680070 00000000  J.o.s.e.p.h.....

    00092220  00030007 000c0100 00091d40 00092780  ........@....'..

    00092230  000923e8 000923e8 00092238 00092238  .#...#..8"..8"..

    00092240  00000000 000805e8 00000000 00000000  ................

    00092250  00020019 00000000 00070005 00080100  ................

    00092260  00000000 00092378 00092b48 000925f0  ....x#..H+...%..

    00092270  00000000 00000000 d3cc9553 00000000  ........S.......

    00092280  00050005 00080100 000922b0 000922d8  ........."..."..

     

    0:000> du 000921e8+28

    00092210  "Joseph"

     

     

    On a Windows 2003 stack I again examined the first parameter to CLUSAPI!ClusterResourceControl...

     

    lpBytesReturned = 0x0006f2a8

     

    ChildEBP RetAddr  Args to Child             

    0006f278 0102bb53 000a03c8 00000000 01000059 CLUSAPI!ClusterResourceControl

    0006f2ac 0102ec09 000a03c8 01000059 00000000 cluadmin!CClusPropList::ScGetResourceProperties+0x78

    0006f328 010142c7 00000000 0026d6b8 01078468 cluadmin!CResource::ReadItem+0xcb

    0006f36c 010161dc 00268508 01078468 00000001 cluadmin!CClusterDoc::InitResources+0xa2

    0006f3a8 01016647 01078468 00000000 01078a48 cluadmin!CClusterDoc::CollectClusterItems+0xe9

    0006f3c8 010167a3 002667b8 00268508 01078468 cluadmin!CClusterDoc::OnOpenDocumentWorker+0x90

    0006f400 7f062793 002667b8 00268508 002667b8 cluadmin!CClusterDoc::OnOpenDocument+0x9c

    0006f428 01010d97 002667b8 00000001 00264da0 MFC42u!CMultiDocTemplate::OpenDocumentFile+0x103

    0006fc50 01010a9b 002667b8 00264da0 00268680 cluadmin!CClusterAdminApp::OpenDocumentFile+0xdc

    0006fc64 7f03babe 00099c48 00264da0 00264da0 cluadmin!CClusterAdminApp::OnClusterConnectionOpened+0x27

    0006fce8 7f038edc 0000040e 00099c48 01006d70 MFC42u!CWnd::OnWndMsg+0x62e

    0006fd10 7f03af27 0000040e 00099c48 00264da0 MFC42u!CWnd::WindowProc+0x2c

    0006fd70 7f03b06e 00268680 00000000 0000040e MFC42u!AfxCallWndProc+0xa7

    0006fd94 7f0e6d8d 000501c6 0000040e 00099c48 MFC42u!AfxWndProc+0x3e

    0006fdc4 7739b6e3 000501c6 0000040e 00099c48 MFC42u!AfxWndProcBase+0x4d

    0006fdf0 7739b874 7f0e6d40 000501c6 0000040e USER32!InternalCallWinProc+0x28

    0006fe68 7739ba92 00000000 7f0e6d40 000501c6 USER32!UserCallWinProcCheckWow+0x151

    0006fed0 7739bad0 01049b24 00000000 0006ff08 USER32!DispatchMessageWorker+0x327

    0006fee0 7f073000 01049b24 01049b24 01049af0 USER32!DispatchMessageW+0xf

    0006fef0 7f072dda 01049af0 01049af0 ffffffff MFC42u!CWinThread::PumpMessage+0x40

     

    0:000> dc 000a03c8+b8

    000a0480  00650052 006f0073 00720075 00650063  R.e.s.o.u.r.c.e.

    000a0490  005c0073 00330031 00350034 00310030  s.\.1.3.4.5.0.1.

    000a04a0  00370030 0031002d 00340065 002d0034  0.7.-.1.e.4.4.-.

    000a04b0  00660034 00310035 0038002d 00370061  4.f.5.1.-.8.a.7.

    000a04c0  002d0030 00610035 00300034 00630034  0.-.5.a.4.0.4.c.

    000a04d0  00350032 00620031 00650036 abab0000  2.5.1.b.6.e.....

    000a04e0  abababab feeeabab 00000000 00000000  ................

    000a04f0  000f0007 001c0731 000a0530 fedcba98  ....1...0.......

     

    0:000> du 000a03c8+b8

    000a0480  "Resources\13450107-1e44-4f51-8a7"

    000a04c0  "0-5a404c251b6e"

     

     

    You will also notice that Windows 2003 shows the resource guid for the resource.

     

    NOTE: You can speculate that the problem resource is the one in the bottom view of the Cluadmin GUI, however using that method can lead to an improper conclusion because the next resource may actually be causing the problem. The resource may have hung the GUI before getting displayed in Cluadmin.

     

    Ok now I know the resource because I dumped it out so what can I do with it now? You can take the resource offline to see if Cluster exhibits an issue bringing that resource off or online.  Additionally, the cluster log may show that this resource is having an issue. Your best bet from here would be to review the cluster log for any additional information about the specific resource or get a Resource Monitor (Resrcmon.exe) dump to investigate any further issues. I also did not show all the threads of the Cluadmin.exe dump since the rest are normally RPC calls and irrelevant to this discussion. If the connection was to another machine we would do a network trace to reveal additional problems.

     

  • Ntdebugging Blog

    NTFS Misreports Free Space?

    • 14 Comments

    I have recently seen a number of issues where customers called in to report a significant difference between the “Size on disk” for the root of a volume, and the reported amount of “Used space” in the volume properties.  While considering this, I noticed that my own C: drive had a mysterious 20GB difference.

     

    Below is the story of how I found the answer.

     

    Before we begin, there are two methods used for calculating disk usage…

     

    Method 1 – Volume Bitmap Analysis

    The % used and %free indication shown below with the pie chart is based on volume bitmap analysis.  The hidden $Bitmap:$Data:”” stream is read from the volume via the FSCTL_GET_VOLUME_BITMAP IO control code, and the ones and zeroes are counted.  A binary “1” represents an allocated cluster, while a “0” indicates a free cluster.  The total number of set bits indicates the total number of allocated clusters on the volume.  This is the fastest and most reliable method of calculating allocated and free space.

     

    Method 2 – File Scanning with FindFirstFileEx  / FindNextFile

    Navigating to the “C:\” folder, we select all files/folders, then right click, and then select properties.  A dialog will appear as the shell enumerates all files that it can find in the root and subfolders.  The “Size” and “Size on disk” fields will eventually display the total size of all primary data streams for all files that are found by Explorer.  In my case, this method only finds 80.25 GB of 102 GB (leaving a 21.75 GB difference).

     

    Why such a big difference?

     

    The short answer is - “If you can’t see it with Explorer, it doesn’t get added to the total size.”

     

    How to find the hidden disk usage…

     

    Step 1:

     

    First, check to see if your files are all actually visible with the current user account and Explorer settings.  Uncheck the “Hide protected operating system files” and select “Show hidden files and folders”.  When this is completed, look at the size on disk again. 

    Does it match the pie chart within a few percent?  If not, continue to Step 2.

     

    Step 2:

    Run CHKDSK and save the output.  The summary information at the end is what we want to see.

    Based on CHKDSK output, we can calculate the total metadata usage by adding up the following…

     

    KB

    GB

    Description

    57036

    .05 GB

    Space used by 22677 indexes.

    0

    0 GB

    Space used by $Badclus file.

    1352680

    1.29 GB

    Space used by $MFT.

    65536

    .06 GB

    Space used by $Loffile.

    1475252

    1.4 GB

     Metadata Total

     

    In this example, metadata usage accounts for only 1.4 GB.  If the metadata number were a high percentage of the usage, then we need to take closer a look at how the volume is being used instead of simply looking for hidden files.  High metadata usage will be the subject for part 2 of this blog.

     

    We can see from the CHKDSK output that the total space occupied by our user files is 106,095,080 KB (101.18 GB).  This is a large percentage of the total usage, so we should be looking at the user files to see why we can’t see these files with Explorer.

     

    Step 3:

    Do you have permissions to see all files/folders the volume? 

     

    a.     If this is a system drive, look through the “\Documents and Settings”, or “\users” folder and see if you can browse all of the profile folders.  If not, you may need contact the owner(s) to check their folder size and see what they have saved in their user folder.  If they are like me, they will have 5 years worth of memory dumps, MP3’s, VHD’s, PDF’s, etc.

    b.    Try “CACLS.EXE x:\ /T”, or just browse through the folders on the drive looking for an “Access Denied” error.  Once this happens, give yourself permissions to that part of the subtree (if permitted by your administrative privileges) or have someone who has access to that folder enumerate the contents for you.  Check the number of files and the size on disk for all files in that folder and add it to the total.

    c.     Continue this process until you have a total of all files in all folders.

    d.    Does the total make sense?  If not, then continue to the next step.

     

    In the case of my mysterious 20 GB difference, I found an old user profile from a previous installation.  Sure enough, I received an access denied error when browsing to it.  To access the folder, I acquired ownership.  This user was in the habit of collecting memory dumps and virtual machines in a subfolder on his desktop.  I recovered quite a bit of free space by using the delete key.  I rescanned the volume, but to my disappointment, there was still a significant difference.

     

    Step 4.

    Are there alternate streams?  Try using STREAMS.EXE from (http://technet.microsoft.com/en-us/sysinternals/default.aspx).  Streams will recurse subdirectories and report space usage by alternate named streams.  Pipe this to a text file and analyze it with a text editor or spreadsheet program.

     

    Step 5.

    Hard links can be problematic when calculating the total usage via the file scan method.  Hard links are not extensively used, but this is worth mentioning.  A hard link is an extra index entry that points to an existing file.  These are created via the CreateHardLink function.  Although the file actually exists in one location, each hard link has its own size field.  Because of this, a single file can be added twice to the running total.  Another thing to know is that hard links are not synchronized, so it is possible that only one link will show the actual file size (see the example below).

    Unfortunately, there are few options available for detecting hard link paradoxes, but it is something to consider when the file scan shows more usage than the bitmap.  Since we have the opposite situation here, hard links are not a significant factor.

     

    Step 6.

    Is Volume Shadow Copy Service maintaining diff area files for snapshots?  Use VSSADMIN LIST SHADOWSTORAGE to find out.  Add shadow storage to the running total.

     

    VSSVC pre-allocates space for making volume snapshots.  In order to support this feature, diff area files are kept in the “\System Volume Information” folder.  This pre-allocated space is used to maintain point-in-time information for the “Previous Versions” feature and for the “System Restore” application.  If you are the type of user who prefers to minimize the impact of such features, then you can resize your shadow storage with VSSADMIN so it has less impact on disk usage.  I prefer to leave these features at their default settings (and just make a note of how much disk space it is using).

    Step 7.

     

    If your numbers still don’t make sense, then it’s time to check for actively hidden files / folders.  There are many different rootkit scanners that can help you identify the presence of actively hidden files & folders.   Try using a scanner like Rootkit Revealer.  If you suspect that your machine has been compromised by a rootkit, refer to http://www.microsoft.com/uk/business/security/security-malware.mspx.

     

    Seeing that I did not have a significant amount of mysterious usage, I was satisfied that there was no malicious logic hiding my files.  After taking into account all of the factors, there was only a 1.57GB difference remaining (which was accounted for by the contents of a user’s recycle bin).

     

    Whenever you see a difference between the “Used space”, and “Size on disk”, always ask yourself “What am I not seeing?”

     

     

    Best regards,

     

    Dennis Middleton “The NTFS Doctor”

     

Page 18 of 25 (242 items) «1617181920»