• Ntdebugging Blog

    Understanding ARM Assembly Part 1

    • 10 Comments

    My name is Marion Cole, and I am a Sr. EE in Microsoft Platforms Serviceability group.  You may be wondering why Microsoft support would need to know ARM assembly.  Doesn’t Windows only run on x86 and x64 machines?  No.  Windows has ran on a variety of processors in the past.  Those include i860, Alpha, MIPS, Fairchild Clipper, PowerPC, Itanium, SPARC, 286, 386, IA-32, x86, x64, and the newest one is ARM.  Most of these processors are antiquated now.  The common ones now are IA-32, x86, x64.  However Windows has started supporting ARM processors in order to jump into the portable devices arena.  You will find them in the Microsoft Surface RT, Windows Phones, and other things in the future I am sure.  So you may be saying that these devices are locked, and cannot be debugged.  That is true from a live debug perspective, but you can get memory dumps and application dumps from them and those can be debugged.

     

    Processor

     

    There are limitations on ARM processors that Windows supports.  There are 3 System on Chip (SOC) vendors that are supported.  nVidia, Texas-Instruments, and Qualcomm. Windows only supports the ARMv7 (Cortex, Scorpion) architecture in ARMv7-A in (Application Profile) mode.  This implements a traditional ARM architecture with multiple modes and supporting a Virtual Memory System Architecture (VMSA) based on an MMU.  It supports the ARM and Thumb-2 instruction sets which allows for a mixture of 16 (Thumb) and 32 (ARM) bit opcodes.  So it will look strange in the assembly.  Luckily the debuggers know this and handle it for you.  This also helps to shrink the size of the assembly code in memory.  The processor also has to have the Optional ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).

     

    In order to understand the assembly that you will see you need to understand the processor internals.

     

    ARM is a Reduced Instruction Set Computer (RISC) much like some of the previous processors that Windows ran on.  It is a 32 bit load/store style processor.  It has a “Weakly-ordered” memory model: similar to Alpha and IA64, and it requires specific memory barriers to enforce ordering.  In ARM devices these as ISB, DSB, and DMB instructions.

     

    Registers

     

    The processor has 16 available registers r0 – r15.

    0: kd> r

    r0=00000001  r1=00000000  r2=00000000  r3=00000000  r4=e1820044  r5=e17d0580

    r6=00000001  r7=e17f89b9  r8=00000002  r9=00000000 r10=1afc38ec r11=e1263b78

    r12=e127813c  sp=e1263b20  lr=e16c12c3  pc=e178b6d0 psr=00000173 ----- Thumb

     

    r0, r1, r2, r3, and r12 are volatile registers.  Volatile registers are scratch registers presumed by the caller to be destroyed across a call.  Nonvolatile registers are required to retain their values across a function call and must be saved by the callee if used. 

     

    On Windows four of these registers have a designated purpose.  Those are:

    • PC (r15) – Program Counter (EIP on x86)
    • LR (r14) – Link Register.  Used as a return address to the caller.
    • SP (r13) – Stack Pointer (ESP on x86).
    • R11 – Frame Pointer (EBP on x86).
    • CPSR – Current Program Status Register (Flags on x86).

     

    In Windbg all but r11 will be labeled appropriately for you.  So you may be asking why r11 is not labeled “fp” in the debugger.  That is because r11 is only used as a frame pointer when you are calling a non-leaf subroutine.  The way it works is this: when a call to a non-leaf subroutine is made, the called subroutine pushes the value of the previous frame pointer (in r11) to the stack (right after the lr) and then r11 is set to point to this location in the stack, so eventually we end up with a linked list of frame pointers in the stack that easily enables the construction of the call stack. The frame pointer is not pushed to the stack in leaf functions.  Will discuss leaf functions later.

     

    CPSR (Current Program Status Register)

     

    Now we need to understand some about the CPSR register.  Here is the bit breakdown:

    31

    30

    29

    28

    27

    26

    25

    24

    23

    22

    21

    20

    19

    18

    17

    16

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    0

    N

    Z

    C

    V

    Q

    IT

    J

    Reserved

    GE

    IT

    E

    A

    I

    F

    T

    M

     

    • Bits [31:28] – Condition Code Flags
      • N – bit 31 – If this bit is set, the result was negative.  If bit is cleared the result was positive or zero.
      • Z – bit 30 – If set this bit indicates the result was zero or values compared were equal.  If it is cleared, the value is non-zero or the compared values are not equal.
      • C – bit 29 – If this bit is set the instruction resulted in a carry condition.  E.g. Adding two unsigned values resulted in a value too large to be strored.
      • V – bit 28 – If this bit is set then the instruction resulted in an overflow condition.  E.g. An overflow of adding two signed values.
    • Instructions variants ending with ‘s’ set the condition codes (mov/movs)
    • E – bit 9 – Endianness (big = 1/Little = 0)
    • T – bit 5 – Set if executing Thumb instructions
    • M – bits [4:0] – CPU Mode (User 10000/Supervisor 10011)

     

    So why do I need to know about the CPSR (Current Program Status Register)?  You will need to know where some of these bits are due to how some of the assembly instruction affect these flags.  Example of this is:

     

    ADD will add two registers together, or add an immediate value to a register.  However it will not affect the flags.

     

    ADDS will do the same as ADD, but it does affect the flags.

     

    MOV will allow you to move a value into a register, and a value between registers.  This is not like the x86/x64.  MOV will not let you read or write to memory.  This does not affect the flags.

     

    MOVS does the same thing as MOV, but it does affect the flags.

     

    I hope you are seeing a trend here.  There are instructions that will look the same.  However if they end in “S” then you need to know that this will affect the flags.  I am not going to list all of those assembly instructions here.  Those are already listed in the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition at http://infocenter.arm.com/help/topic/com.arm.doc.ddi0406b/index.html.

     

    So now we have an idea of what can set the flags.  Now we need to understand what the flags are used for.  They are mainly used for branching instructions.  Here is an example:

    003a11d2 429a     cmp         r2,r3

    003a11d4 d104     bne         |MyApp!FirstFunc+0x28 (003a11e0)|

     

    The first instruction in this code (cmp) compares the value stored in register r2 to the value stored in register r3. This comparison instruction sets or resets the Z flag in the CPSR register. The second instruction is a branch instruction (b) with the condition code ne which means that if the result of the previous comparison was that the values are not equal (the CPSR flag Z is zero) then branch to the address MyApp!FirstFunc+0x28 (003a11e0). Otherwise the execution continues.

     

    There are a few compare instructions.  “cmp” subtracts two register values, sets the flags, and discards the result.  “cmn” adds two register values, sets the flags, and discards the results.  “tst” does a bit wise AND of two register values, sets the flags, and discards the results.  There is even an If Then (it) instruction.  I am not going to discuss that one here as I have never seen it in any of the Windows code.

     

    So is “bne” the only branch instruction?  No.  There is a lot of them.  Here is a table of things that can be seen beside “b”, and what they check the CPSR register:

    Mnemonic

    Extension

    Meaning (Integer)

    Condition Flags (in CPSR)

    EQ

    Equal

    Z==1

    NE

    Not Equal

    Z==0

    MI

    Negative (Minus)

    N==1

    PL

    Positive or Zero (Plus)

    N==0

    HI

    Unsigned higher

    C==1 and Z==0

    LS

    Unsigned lower or same

    C==0 or Z==1

    GE

    Signed greater than or equal

    N==V

    LT

    Signed less than

    N!=V

    GT

    Signed greater than

    Z==0 and N==V

    LE

    Signed less than or equal

    Z==1 or N!=V

    VS

    Overflow

    V==1

    VC

    No overflow

    V==0

    CS

    Carry set

    C==1

    CC

    Carry clear

    C==0

    None (AL)

    Execute always

     

     

    Floating Point Registers

     

    As mentioned earlier the processor also has to have the ISA extensions of VFP (Hardware Floating Point) and NEON (128-bit SIMD Architecture).  Here is what they are.

    Floating Point

     

    As you can see this is 16 – 64bit regiters (d0-d15) that is overlaid with 32 – 32bit registers (s0-s31).  There are varieties of the ARM processor that has 32 – 64bit registers and 64 – 32bit registers.  Windows 8 will support both 16 and 32 register variants.  You have to be careful when using these, because if you access unaligned floats you may cause an exception.

     

    SIMD (NEON)

     

    As you can see here the SIMD (NEON) extension adds 16 – 128 bit registers (q0-q15) onto the floating point registers.  So if you reference Q0 it is the same as referencing D0-D1 or S0-S1-S2-S3.

     

    In part 2 we will discuss how Windows utilizes this processor.

  • Ntdebugging Blog

    The Compiler Did What?

    • 0 Comments

    I was recently investigating a crash in an application.  As I researched the issue I found a very old defect in the code that was only recently being exposed by the compiler.

     

    The crash occurred at the below instruction because the ebx register does not hold a valid pointer.

     

    0:001> r

    eax=d9050cf7 ebx=003078c0 ecx=6e2e0000 edx=00000000 esi=00000001 edi=0c334468

    eip=65637fbe esp=010eb408 ebp=010eb878 iopl=0         nv up ei pl nz na po nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202

    riched20!CTxtSelection::CreateCaret+0x429:

    65637fbe 8b4b1c          mov     ecx,dword ptr [ebx+1Ch] ds:002b:003078dc=????????

    0:001> dd 003078c0

    003078c0  ???????? ???????? ???????? ????????

    003078d0  ???????? ???????? ???????? ????????

    003078e0  ???????? ???????? ???????? ????????

    003078f0  ???????? ???????? ???????? ????????

    00307900  ???????? ???????? ???????? ????????

    00307910  ???????? ???????? ???????? ????????

    00307920  ???????? ???????? ???????? ????????

    00307930  ???????? ???????? ???????? ????????

     

    Examining the assembly leading up to the crash, ebx came from [ebp-40c].

     

    0:001> ub .

    riched20!CTxtSelection::CreateCaret+0x408:

    65637f9d 6a08            push    8

    65637f9f ff156cf06465    call    dword ptr [riched20!_imp__CreateBitmap (6564f06c)]

    65637fa5 898784000000    mov     dword ptr [edi+84h],eax

    65637fab eb06            jmp    riched20!CTxtSelection::CreateCaret+0x41e (65637fb3)

    65637fad 8bb5e4fbffff    mov     esi,dword ptr [ebp-41Ch]

    65637fb3 8b9df4fbffff    mov    ebx,dword ptr [ebp-40Ch]

    65637fb9 ff775c          push    dword ptr [edi+5Ch]

    65637fbc 6a01            push    1

    0:001> dd @ebp-40c l1

    010eb46c  003078c0

     

    Looking at the whole function, [ebp-40c] was populated at the beginning of the function as the contents of edi+1C. The contents of edi+1Ch were first moved into ecx and later the value of ecx was moved into [ebp-40Ch].    Further examination of the whole function showed the edi register is unchanged at the time of the crash, so I can use its current value to determine what [ebp-40c] should contain.

     

    0:001> uf riched20!CTxtSelection::CreateCaret

    riched20!CTxtSelection::CreateCaret:

    65637b95 8bff            mov     edi,edi

    65637b97 55              push    ebp

    65637b98 8bec            mov     ebp,esp

    65637b9a 81ec5c040000    sub     esp,45Ch

    65637ba0 a100e06465      mov     eax,dword ptr [riched20!__security_cookie (6564e000)]

    65637ba5 33c5            xor     eax,ebp

    65637ba7 8945fc          mov     dword ptr [ebp-4],eax

    65637baa 53              push    ebx

    65637bab 56              push    esi

    65637bac 57              push    edi

    65637bad 8bf9            mov     edi,ecx

    65637baf 8b4f1c          mov    ecx,dword ptr [edi+1Ch] <<< The value originates from [edi+1Ch]

    65637bb2 0fbf4740        movsx   eax,word ptr [edi+40h]

    65637bb6 898df4fbffff    mov    dword ptr [ebp-40Ch],ecx <<< Store the value on the stack

    <snip>

    65637fb3 8b9df4fbffff    mov    ebx,dword ptr [ebp-40Ch] <<< Read the value from the stack

    <snip>

    65637fbe 8b4b1c          mov    ecx,dword ptr [ebx+1Ch] <<< Crash here because ebx is invalid

    <snip>

     

    The expected value of [ebp-40C], and thus the expected value of the ebx register, is 091978c0 based on the value in [edi+1Ch] at the time of the crash.  This would be a valid pointer and is not what is currently in [ebp-40C] or ebx.  It is noteworthy that at the time of the crash, ebx is similar to what should be there, it differs only by the high word of the dword.

     

    0:001> r ebx

    ebx=003078c0

     

    0:001> dd @edi+1c l1

    0c334484  091978c0

     

    The expected value, 091978c0, is a valid pointer.

     

    0:001> dd 091978c0

    091978c0  091978c8 00000000 00000501 05000000

    091978d0  00000015 076c1a27 2a372f35 0c2e3998

    091978e0  000049aa 00000000 00000000 00000000

    091978f0  00000000 00000000 00000000 00000000

    09197900  00000000 00000000 00000000 00000000

    09197910  00000000 00000000 00000000 00000000

    09197920  1a3098a8 00000000 00000000 00000000

    09197930  00000000 00000000 00000000 00000000

     

    Somehow the value at ebp-40C was changed between instruction 65637bb6, where [ebp-40C] was set, and instruction 65637fb3 where [ebp-40C] was read.  Fortunately I had a mechanism to reproduce this crash so I was able to set a breakpoint and trace through how this happened.

     

    First I set a breakpoint on the instruction that populates [ebp-40C].

     

    0:003> bp 65637bb6

    0:003> g

    Breakpoint 0 hit

    eax=ffffffff ebx=0c334468 ecx=091978c0 edx=00000060 esi=091978c0 edi=0c334468

    eip=65637bb6 esp=010eb410 ebp=010eb878 iopl=0         nv up ei pl nz na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000206

    riched20!CTxtSelection::CreateCaret+0x21:

    65637bb6 898df4fbffff    mov     dword ptr [ebp-40Ch],ecx ss:002b:010eb46c=00000000

     

    Next I calculated ebp-40C and set a break on write access breakpoint.

     

    0:001> ?@ebp-40c

    Evaluate expression: 17740908 = 010eb46c

    0:001> ba w4 010eb46c

    0:001> g

    Breakpoint 1 hit

    eax=00000030 ebx=00000000 ecx=00000000 edx=00000020 esi=00000001 edi=0c334468

    eip=65637f67 esp=010eb40c ebp=010eb878 iopl=0         nv up ei pl zr na pe nc

    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    riched20!CTxtSelection::CreateCaret+0x3d2:

    65637f67 66898475f4fbffff mov     word ptr [ebp+esi*2-40Ch],ax ss:002b:010eb46e=0919

     

    The write breakpoint hit at a location I was not expecting.  The instruction where the breakpoint hit is not modifying the variable that was stored at [ebp-40C].

     

    Although I cannot share the Windows source code on this blog, the code in question roughly resembles the below example.  Note that a proficient assembly language reader could figure out the code flow, this example is not sharing any magic.

     

    Struct1*    p1;

    WORD        array[512];

    p1 = GetStruct1();

    array[i-2] = 0x30;

    p1->p = variable2; // Crash here because p1 is not a valid pointer

     

    We are crashing because p1 is not a valid pointer.  The high word of p1 is being overwritten as 0030 by the line “array[i-2] = 0x30;” because i is 1, leading to an underflow of the array.  This underflow is corrupting the pointer in p1.

     

    0:001> r ebx

    ebx=003078c0

     

    Clearly there is a defect in the above code.  If it is legitimate for i to be 1 (and it is), then a check must be made to prevent an underflow of the array.  However further research found that this code has been consistent for many years and many releases of the product.  Why is this suddenly crashing now?  As the bank robber in Dirty Harry said, “I gots to know."

     

    In the above assembly we calculate that “array” starts at ebp-408 (assuming i is always 2 or greater, 2*2-40c is -408).  In the earlier assembly we see that p1 is placed at ebp-40c.  In this configuration an underflow of “array” will always corrupt p1.

     

    Examining the assembly on a system that does not crash, I found that the local variables are stored differently in a different version of this binary.  In the beginning of the function we see that p1 is stored in ebx.  In this version of the binary ebx is never stored on the stack, so it cannot be corrupted by an underflow.

     

    0:000> uf riched20!CTxtSelection::CreateCaret

    riched20!CTxtSelection::CreateCaret:

    74e75c53 8bff            mov     edi,edi

    74e75c55 55              push    ebp

    74e75c56 8bec            mov     ebp,esp

    74e75c58 81ec58040000    sub     esp,458h

    74e75c5e a19010e974      mov     eax,dword ptr [riched20!__security_cookie (74e91090)]

    74e75c63 53              push    ebx

    74e75c64 56              push    esi

    74e75c65 8bf1            mov     esi,ecx

    74e75c67 8b5e1c          mov    ebx,dword ptr [esi+1Ch]

     

    The code that populates array[i-2] with 0x30 is later in the function.  In this version, array is stored at ebp-404.  If there is an underflow it will corrupt ebp-408.

     

    riched20!CTxtSelection::CreateCaret+0x3e1:

    74e76034 66c7847df8fbffff3000 mov word ptr [ebp+edi*2-408h],30h

     

    The value stored at ebp-408 is used in several places in this function, however it is never used after instruction 74e76034 executes.  This means any underflow in the array only corrupts memory that is not used after the corruption, and as a result the corruption never results in a crash.  Although this defect has existed for a long time, the compiler has protected us until now.

     

    74e75d3f 0b85f8fbffff    or      eax,dword ptr [ebp-408h]

    74e75e51 ffb5f8fbffff    push    dword ptr [ebp-408h]

    74e75e8a 8b8df8fbffff    mov     ecx,dword ptr [ebp-408h]

    74e75f20 398df8fbffff    cmp     dword ptr [ebp-408h],ecx

    74e75fec 8b85f8fbffff    mov     eax,dword ptr [ebp-408h]

     

    The issue discussed in this article was addressed as part of KB2883200.

  • Ntdebugging Blog

    We Are Hiring Windows Escalation Engineers in Charlotte and Issaquah

    • 1 Comments

    Would you like to join the world’s best and most elite debuggers to enable the success of Microsoft solutions?

     

    As a trusted advisor to our top customers you will be working with to the most experienced IT professionals and developers in the industry. You will influence our product teams in sustained engineering efforts to drive improvements in our products.

     

    This role involves deep analysis of product source code and debugging to solve problems in multi-million dollar configurations and will give you an opportunity to stretch your critical thinking skills. During the course of debugging, you will uncover opportunities to improve the customer experience while influencing the current and future design of our products.

     

    In addition to providing support to customers while being the primary interface to our sustained engineering teams, you will also have the opportunity to work with new technologies and unreleased software. Through our continuous investment in depth training and hands-on experience with tough customer challenges you will become the world’s best in this area. Expect to partner with many various roles at Microsoft launching a very successful career!

     

    We have positions open at our sites in Charlotte, NC USA; and Issaquah, WA USA.

     

    Learn more about what an Escalation Engineer does at:

    Profile: Ron Stock, CTS Escalation Engineer - Microsoft Customer Service & Support - What is CSS?

    Microsoft JobsBlog JobCast with Escalation Engineer Jeff Dailey

    Microsoft JobsBlog JobCast with Escalation Engineer Scott Oseychik

     

    Apply here:

    Charlotte: http://www.microsoft-careers.com/job/Charlotte-Escalation-Engineer-Job-NC-28201/23321500/

    Issaquah: https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=8&jid=122974&jlang=EN&pp=SS

  • Ntdebugging Blog

    Great power. Great responsibility.

    • 1 Comments

    When it comes to the registry, administrators are given great power to manually configure Windows to suit their needs, but even slight, seemingly innocuous changes to a particular key or value can have a drastic impact on basic operations of the system, even affecting its ability to boot properly.

     

    I recently had the pleasure of the debugging a black-screen system hang that occurred after applying security updates and rebooting.  After ruling out any “low-hanging fruit” such as deadlocks on executive resources, resource depletion, etc., I decided to survey how far along the boot had gotten.  In the output below we can tell that it’s fairly early in the boot process and that session zero is currently being setup.

     

    When a new session is created, the “Session Leader”(i.e. smss.exe instance not associated with a particular session) launches a new instance of smss.exe, who is then tasked with ensuring that the Windows subsystem gets setup properly, which includes loading and initializing win32k.sys and launching csrss.exe.

     

    3: kd> !process 0 0

    **** NT ACTIVE PROCESS DUMP ****

    PROCESS 8e282840  SessionId: none  Cid: 0004    Peb: 00000000  ParentCid: 0000

        DirBase: 00122000  ObjectTable: 97801e18  HandleCount: 554.

        Image: System

     

    PROCESS 94153ad8  SessionId: none  Cid: 0390    Peb: 7ffdf000  ParentCid: 0004

        DirBase: 03368020  ObjectTable: a13564f0  HandleCount:  19.

        Image: smss.exe

     

    PROCESS 92a55d90  SessionId: 0  Cid: 03c8    Peb: 7ffd9000  ParentCid: 0390

        DirBase: 03368040  ObjectTable: a95606e0  HandleCount:  10.

        Image: smss.exe

     

    PROCESS 92a56c48  SessionId: 0  Cid: 03d4    Peb: 7ffd9000  ParentCid: 03c8

        DirBase: 03368060  ObjectTable: a959ea28  HandleCount:  30.

        Image: csrss.exe

     

    So let’s dump out the threads for these session zero processes and see what they’re doing:

     

    1.  Notice how the Session Manager thread has been waiting for more than fifteen minutes for the Windows subsystem to load and initialize.

     

    3: kd> !process /s 0 0 0x17

    Searching processes with session id 0

    **** NT ACTIVE PROCESS DUMP ****

    PROCESS 92a55d90  SessionId: 0  Cid: 03c8    Peb: 7ffd9000  ParentCid: 0390

        DirBase: 03368040  ObjectTable: a95606e0  HandleCount:  10.

        Image: smss.exe

        VadRoot 941e0578 Vads 8 Clone 0 Private 21. Modified 535. Locked 0.

        DeviceMap 97808b98

        Token                             a95767b8

        ElapsedTime                       00:15:32.714

        UserTime                          00:00:00.000

        KernelTime                        00:00:00.000

        QuotaPoolUsage[PagedPool]         6952

        QuotaPoolUsage[NonPagedPool]      384

        Working Set Sizes (now,min,max)  (125, 50, 345) (500KB, 200KB, 1380KB)

        PeakWorkingSetSize                125

        VirtualSize                       2 Mb

        PeakVirtualSize                   4 Mb

        PageFaultCount                    120

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      29

     

            THREAD 92a5d030  Cid 03c8.03cc  Teb: 7ffdf000 Win32Thread: 00000000 WAIT: (UserRequest) UserMode Non-Alertable

                941df930  SynchronizationEvent

                92a56c48  ProcessObject

            Not impersonating

            DeviceMap                 97808b98

            Owning Process            92a55d90       Image:         smss.exe

            Attached Process          N/A            Image:         N/A

            Wait Start TickCount      1896           Ticks: 59778 (0:00:15:32.542)

            Context Switch Count      94             IdealProcessor: 0            

            UserTime                  00:00:00.000

            KernelTime                00:00:00.046

            Win32 Start Address smss!NtProcessStartupW (0x4857d9a2)

            Stack Init 9dc89000 Current 9dc888c0 Base 9dc89000 Limit 9dc86000 Call 0

            Priority 9 BasePriority 8 PriorityDecrement 0 IoPriority 2 PagePriority 5

            Kernel stack not resident.

            ChildEBP RetAddr  Args to Child             

            9dc888d8 81eb923a 92a5d030 97cb5120 92a5d0b8 nt!KiSwapContext+0x26

            9dc8891c 81eb4bca 92a5d030 00000000 00000002 nt!KiSwapThread+0x44f

            9dc88970 82040e83 00000002 9dc88aa8 00000001 nt!KeWaitForMultipleObjects+0x53d

            9dc88bfc 82040bf2 00000002 00000001 00000000 nt!ObpWaitForMultipleObjects+0x256

            9dc88d48 81e57c96 00000002 0008fb38 00000001 nt!NtWaitForMultipleObjects+0xcc

            9dc88d48 778d5d14 00000002 0008fb38 00000001 nt!KiSystemServicePostCall

            0008fac4 778d54a0 4857cc7e 00000002 0008fb38 ntdll!KiFastSystemCallRet

            0008fac8 4857cc7e 00000002 0008fb38 00000001 ntdll!NtWaitForMultipleObjects+0xc

            0008fb40 48579296 0008fb78 0008fb68 0008fbb0 smss!SmscpLoadSubSystem+0x9b

            0008fb80 4857ca8a 0008fbb0 00000000 00000000 smss!SmpExecuteCommand+0x8d

            0008fbc4 4857d0bc 00000000 00000000 00000000 smss!SmscpLoadSubSystemsForMuSession+0x182

            0008fbe8 4857b678 00000003 002417d8 00000000 smss!SmscMain+0xc2

            0008fc7c 4857d988 00000003 002417d8 002417e8 smss!wmain+0x50

            0008fcc0 77886885 00241898 779bde2d 00000000 smss!NtProcessStartupW_AfterSecurityCookieInitialized+0x221

            0008fd00 778b15d6 4857d9a2 7ffd9000 ffffffff ntdll!__RtlUserThreadStart+0x35

            0008fd18 00000000 4857d9a2 7ffd9000 00000000 ntdll!_RtlUserThreadStart+0x1b

     

    2.  Also, we can see that there’s a single active thread within the csrss.exe process, which is a red flag because we know that csrss.exe hosts the Desktop Thread and Raw Input Thread, among others.

     

    The user-mode portion of the Windows subsystem is implemented in csrss.exe and associated “ServerDlls” such as csrsrv.dll, winsrv.dll, basesrv.dll and, on Windows 7 and later, sxssrv.dll.  Also, csrss.exe hosts the Desktop thread and Raw Input thread, whose primary functions include handling inputs from the various input devices.

     

    PROCESS 92a56c48  SessionId: 0  Cid: 03d4    Peb: 7ffd9000  ParentCid: 03c8

        DirBase: 03368060  ObjectTable: a959ea28  HandleCount:  30.

        Image: csrss.exe

        VadRoot 9391c128 Vads 33 Clone 0 Private 193. Modified 60. Locked 0.

        DeviceMap 97808b98

        Token                             a9598b30

        ElapsedTime                       00:15:32.558

        UserTime                          00:00:00.000

        KernelTime                        00:00:03.182

        QuotaPoolUsage[PagedPool]         48312

        QuotaPoolUsage[NonPagedPool]      1584

        Working Set Sizes (now,min,max)  (582, 50, 345) (2328KB, 200KB, 1380KB)

        PeakWorkingSetSize                7285

        VirtualSize                       23 Mb

        PeakVirtualSize                   48 Mb

        PageFaultCount                    49628

        MemoryPriority                    BACKGROUND

        BasePriority                      13

        CommitCharge                      248

     

            THREAD 942c5590  Cid 03d4.03e4  Teb: 00000000 Win32Thread: 00000000 WAIT: (Executive) KernelMode Non-Alertable

                915e4078  NotificationEvent

            Not impersonating

            DeviceMap                 97808b98

            Owning Process            92a56c48       Image:         csrss.exe

            Attached Process          N/A            Image:         N/A

            Wait Start TickCount      2516           Ticks: 59158 (0:00:15:22.870)

            Context Switch Count      1              IdealProcessor: 0            

            UserTime                  00:00:00.000

            KernelTime                00:00:00.000

            Win32 Start Address ati2mtag!IRQMGR_WorkerThreadRoutine (0xa1ccf340)

            Stack Init 9dcfd000 Current 9dcfcc30 Base 9dcfd000 Limit 9dcfa000 Call 0

            Priority 13 BasePriority 13 PriorityDecrement 0 IoPriority 2 PagePriority 5

            ChildEBP RetAddr  Args to Child             

            9dcfcc48 81eb923a 942c5590 942c5618 00000000 nt!KiSwapContext+0x26

            9dcfcc8c 81e54f38 942c5590 00000000 942c5590 nt!KiSwapThread+0x44f

            9dcfcce4 a1d78724 915e4078 00000000 00000000 nt!KeWaitForSingleObject+0x492

            9dcfcd00 a1c13340 908be398 915e4070 00000000 VIDEOPRT!VideoPortWaitForSingleObject+0x53

            9dcfcd14 a1cce17f 908be398 915e4070 00000000 ati2mtag!IRQMgrMP_WaitForSingleObject+0x20

            9dcfcd6c a1ccf355 93f45000 93f45000 9dcfcdc0 ati2mtag!PassiveRing_WorkerThreadRoutine+0x6f

            9dcfcd7c 81fe301c 93f45000 ad8fc28d 00000000 ati2mtag!IRQMGR_WorkerThreadRoutine+0x15

            9dcfcdc0 81e4beee a1ccf340 93f45000 00000000 nt!PspSystemThreadStartup+0x9d

            00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

     

    3.  Having seen this, we now know why the system is perpetually hung:  Csrss.exe is not running properly.  Because there is a video driver worker thread running, but the Desktop Thread and Raw Input Thread are not running, it appears that csrss has attempted to terminate.  The termination has not completed because of the video-driver worker thread performing a non-alertable wait.

     

    4.  Next, we need to check for any state in the dump that might tell us why csrss.exe attempted to terminate:

     

    3: kd> dt nt!eprocess 92a56c48 LastThreadExitStatus

       +0x184 LastThreadExitStatus : 0n-1073741619

     

    3: kd> !error 0n-1073741619

    Error code: (NTSTATUS) 0xc00000cd (3221225677) - The name limit for the local computer network adapter card was exceeded.

     

    After a quick search for STATUS_TOO_MANY_NAMES (0xc00000cd) through the source code, I was able to theorize that csrss.exe may have attempted the termination due to invalid command-line parameters.

     

    3: kd> vertarget

    Windows Server 2008/Windows Vista Kernel Version 6002 (Service Pack 2) MP (16 procs) Free x86 compatible

    Product: Server, suite: Enterprise TerminalServer SingleUserTS

    Built by: 6002.18881.x86fre.vistasp2_gdr.130707-1535

    Machine Name:

    Kernel base = 0x81e0d000 PsLoadedModuleList = 0x81f24c70

    Debug session time: Fri Oct 25 05:10:34.030 2013 (UTC - 5:00)

    System Uptime: 0 days 0:16:02.134

     

    3: kd> .process /p /r 92a56c48

    Implicit process is now 92a56c48

    Loading User Symbols

    .............

     

    3: kd> !peb

    PEB at 7ffd9000

    CommandLine:  'C:\Windows\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,1024 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=winsrv:ConServerDllInitialization,2 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16'

     

    Sure enough, there was additional command-line parameter that was not recognized on Vista/Windows Server 2008 SP2 (supported only on Windows 7 and later).  Once the invalid command-line parameter was removed, the server was able to boot normally again.

     

    So how did the invalid value get there?  It turns out that a logon script was setting the following registry value using an export from a Windows 7/Windows 2008 R2 machine where ServerDll=sxssrv,4 is a valid value.

     

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems

    Name:  Windows

    Type:  REG_EXPAND_SZ

     

    Well, that concludes today’s segment, but in the timeless words of Uncle Ben remember “with great power comes great responsibility.”  As we just saw, this applies not only to those possessing a spider-sense, but also to Windows administrators.  J

     

    Until next time, happy debugging!

  • Ntdebugging Blog

    Debugging a Generation 2 Virtual Machine

    • 1 Comments

    Hyper-V is based on the 440BX (PCI) chipset for emulation. The decision to use this chipset started years ago with Connectix Virtual PC.  The advantage of using an emulated chipset based on a popular motherboard like the 440BX, along with associated peripherals, is the compatibility with a large number of operating systems.

     

    Windows Server 2012 R2 introduced the Generation 2 Virtual Machine. It is a UEFI based design, removing emulated devices and replacing them with synthetic devices. Generation 2 VMs no longer support the following devices:

    • Legacy BIOS
    • COM Ports
    • Floppy Controller
    • DMA Controller
    • i8042 keyboard controller
    • PS/2 devices
    • Legacy NIC
    • IDE Controller
    • S3 video
    • PCI BUS
    • Programmable Interrupt Controller
    • Programmable Interrupt Timer
    • Super I/O Device

     

    After reading this list you might ask the question – how do I debug a Generation 2 VM?

     

    The COM port is not actually removed from a Generation 2 VM. The port is turned off by default and not present in the user interface. To enable it for debugging use the following steps.

     

    1.  Shutdown the VM.  You can verify the VM is off using the below PowerShell command.

     

    2.  Turn off secure boot using the following PowerShell Command.

    set-vmfirmware

    image002

     

    3.  Set a COM port path using the following PowerShell command where the path is equal the named pipe.

    set-vmcomport

    image003

     

    4.  To confirm the COM port settings after making the change, use the following command.

    get-vmcomport

    image004

     

    5.  Restart the Virtual Machine using the following command.

    Start-VM –Name VM2

    image005

     

    6.  Inside the guest VM, you can confirm that UEFI has been disabled with the following command. The results are False if UEFI was successfully disabled in step 2 above.

    Confirm-SecureBootUEFI

    image006

     

    7.  Enable Kernel Debugging using BCDEdit.

    BCDEdit /debug ON

    image007

    image008

     

    8.  Configure the debugger to connect to the pipe:

    KernelDebug1

     

    9.  Connect the debugger and break in with Ctrl+Break:

    KernelDebug2

  • Ntdebugging Blog

    Performance Monitor Averages, the Right Way and the Wrong Way

    • 4 Comments

    Performance Monitor (perfmon) is the preferred tool to measure the performance of Windows systems.  The perfmon tool provides an analysis view with a chart and metrics of the Last, Average, Minimum, and Maximum values.

     

    There are scenarios where the line in the chart is the most valuable piece of information, such as a memory leak.  Other times we may not be looking for a trend, the Last, Average, Minimum, and Maximum metrics may be valuable.  One example where the metrics are valuable is when evaluating average disk latency over a period of time.  In this article we are going to use disk latency counters to illustrate how metrics are calculated for performance counters.  The concepts we will illustrate with disk latency apply to all performance counters.  This article will not be a deep dive into understanding disk latency, there are already many sources of information on that topic.

     

    Most performance counter metrics are pretty straightforward.  The minimum and maximum metrics are self-explanatory.  The last metric is the last entry in the data.  The metric that is confusing is the average.  When calculating averages it is important to consider the cardinality of the data.  This is especially important when working with data that is already an average, such as the Avg. Disk sec/Read counter which displays the average time per each read from a disk.

     

    Perfmon logs are gathered at a specific time interval, such as every 15 seconds.  At every interval the counters are read and an entry is written to the log.  In this interval there may have been many reads from the disk, a few reads, or there may have been none.  The number of reads performed is a critical aspect of the average calculation, this is the cardinality of the data.

     

    Consider the following 10 entries in a perfmon log:

    1 reads took 150ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

    0 reads took 0 ms

     

    Often, averages are calculated by adding a column of numbers and dividing by the number of entries.  However this calculation does not work for the above data.  If we simply add and divide we get an average latency of 15ms (150 / 10) per read, but this is clearly incorrect.  There has been 1 read performed and it took 150ms, therefore the average latency is 150ms per read.  Depending on the system configuration, an average read latency of less than 20ms may be considered fast and more than 20ms may be considered slow.  If we perform the calculation incorrectly we may believe the disk is performing adequately while the correct calculation shows the disk is actually very slow.

     

    What data is used to calculate averages?

    Let’s take a look at the data perfmon is working with.  Perfmon stores data in two different structures.  Formatted values are stored as PDH_FMT_COUNTERVALUE.  Raw values are stored as PDH_RAW_COUNTER.

     

    Formatted values are just plain numbers.  They contain only the result of calculating the average of one or more raw values, but not the raw data used to obtain that calculation.  Data stored in a perfmon CSV or TSV file is already formatted, which means they contain a column of floating point numbers.  If our previous example was stored in a CSV or TSV we would have the following data:

    0.15

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

    0.00

     

    The above numbers contain no information about how many reads were performed over the course of this log.  Therefore it is impossible to calculate an accurate average from these numbers.  That is not to say CSV and TSV files are worthless, there are many performance scenarios (such as memory leaks) where the average is not important.

     

    Raw counters contain the raw performance information, as delivered by the performance counter to pdh.dll.  In the case of Avg. Disk sec/Read the FirstValue contains the total time for all reads and the SecondValue contains the total number of reads performed.  This information can be used to calculate the average while taking into consideration the cardinality of the data.

     

    Again using the above example, the raw data would look like this:

    FirstValue: 0

    SecondValue: 0

    FirstValue: 2147727

    SecondValue: 1

    FirstValue: 2147727

    SecondValue: 1

     

    On first look the above raw data does not resemble our formatted data at all.  In order to calculate the average we need to know what the correct algorithm is.  The Avg. Disk sec/Read counter is of type PERF_AVERAGE_TIMER and the average calculation is ((Nx - N0) / F) / (Dx - D0).  N refers to FirstValue in the raw counter data, F refers to the number of ticks per second, and D refers to SecondValue.  Ticks per second can be obtained from the PerformanceFrequency parameter of KeQueryPerformanceCounter, in my example it is 14318180.

     

    Using the algorithm for PERF_AVERAGE_TIMER the calculation for the formatted values would be:

    ((2147727 - 0) / 14318180) / (1 - 0) = 0.15

    ((2147727 - 2147727) / 14318180) / (1 - 1) = 0*

    *If the denominator is 0 there is no new data and the result is 0.

     

    Because the raw counter contains both the number of reads performed during each interval and the time it took for these reads to complete, we can accurately calculate the average for many entries.

     

    If you’ve taken the time to read this far you may be wondering why I have taken the time to explain such a mundane topic.  It is important to explain how this works because many performance tools are not using the correct average calculation and many users are trying to calculate averages using data that is not appropriate for such calculations (such as CSV and TSV files).  Programmers should use PdhComputeCounterStatistics to calculate averages and should not sum and divide by the count or duplicate the calculations described in MSDN.

     

    Recently we have found that under some conditions perfmon will use the incorrect algorithm to calculate averages.  When reading from log files perfmon has been formatting the values, summing them, and dividing by the number of entries.  This issue has been corrected in perfmon for Windows 8/Server 2012 with KB2877211 and for Windows 8.1/Server 2012 R2 as part of KB2883200.  We recommend using these fixes when analyzing perfmon logs to determine the average of a performance counter.  Note that KB2877211/KB2883200 only change the behavior when analyzing logs, there is no change when the data is collected.  This means you can collect performance logs from any version of Windows and analyze them on a system with these fixes installed.

  • Ntdebugging Blog

    ResAvail Pages and Working Sets

    • 0 Comments

    Hello everyone, I'm Ray and I'm here to talk a bit about a dump I recently looked at and a little-referenced memory counter called ResAvail Pages (resident available pages).

     

    The problem statement was:  The server hangs after a while.

     

    Not terribly informative, but that's where we start with many cases. First some good housekeeping:

     

    0: kd> vertarget

    Windows 7 Kernel Version 7601 (Service Pack 1) MP (2 procs) Free x64

    Product: Server, suite: TerminalServer SingleUserTS

    Built by: 7601.18113.amd64fre.win7sp1_gdr.130318-1533

    Machine Name: "ASDFASDF1234"

    Kernel base = 0xfffff800`01665000 PsLoadedModuleList = 0xfffff800`018a8670

    Debug session time: Thu Aug  8 09:39:26.992 2013 (UTC - 4:00)

    System Uptime: 9 days 1:08:39.307

     

    Of course Windows 7 Server == Server 2008 R2.

     

    One of the basic things I check at the beginning of these hang dumps with vague problem statements is the memory information.

     

    0: kd> !vm 21

     

    *** Virtual Memory Usage ***

    Physical Memory:     2097038 (   8388152 Kb)

    Page File: \??\C:\pagefile.sys

      Current:  12582912 Kb  Free Space:  12539700 Kb

      Minimum:  12582912 Kb  Maximum:     12582912 Kb

    Available Pages:      286693 (   1146772 Kb)

    ResAvail Pages:          135 (       540 Kb)

     

    ********** Running out of physical memory **********

     

    Locked IO Pages:           0 (         0 Kb)

    Free System PTEs:   33526408 ( 134105632 Kb)

     

    ******* 12 system cache map requests have failed ******

     

    Modified Pages:         4017 (     16068 Kb)

    Modified PF Pages:      4017 (     16068 Kb)

    NonPagedPool Usage:   113241 (    452964 Kb)

    NonPagedPool Max:    1561592 (   6246368 Kb)

    PagedPool 0 Usage:     35325 (    141300 Kb)

    PagedPool 1 Usage:     28162 (    112648 Kb)

    PagedPool 2 Usage:     24351 (     97404 Kb)

    PagedPool 3 Usage:     24350 (     97400 Kb)

    PagedPool 4 Usage:     24516 (     98064 Kb)

    PagedPool Usage:      136704 (    546816 Kb)

    PagedPool Maximum:  33554432 ( 134217728 Kb)

     

    ********** 222 pool allocations have failed **********

     

    Session Commit:         6013 (     24052 Kb)

    Shared Commit:          6150 (     24600 Kb)

    Special Pool:              0 (         0 Kb)

    Shared Process:      1214088 (   4856352 Kb)

    Pages For MDLs:           67 (       268 Kb)

    PagedPool Commit:     136768 (    547072 Kb)

    Driver Commit:         15548 (     62192 Kb)

    Committed pages:     1648790 (   6595160 Kb)

    Commit limit:        5242301 (  20969204 Kb)

     

    So we're failing to allocate pool, but we aren't out of virtual memory for paged pool or nonpaged pool.  Let's look at the breakdown:

     

    0: kd> dd nt!MmPoolFailures l?9

    fffff800`01892160  000001be 00000000 00000000 00000002

    fffff800`01892170  00000000 00000000 00000000 00000000

    fffff800`01892180  00000000

     

    Where:

        yellow   = Nonpaged high/medium/low priority failures

        green    = Paged high/medium/low priority failures

        cyan      = Session paged high/medium/low priority failures

     

    So we actually failed both nonpaged AND paged pool allocations in this case.  Why?  We're "Running out of physical memory", obviously.  So where does this running out of physical memory message come from?  In the above example this is from the ResAvail Pages counter.

     

    ResAvail Pages is the amount of physical memory there would be if every working set was at its minimum size and only what needs to be resident in RAM was present (e.g. PFN database, system PTEs, driver images, kernel thread stacks, nonpaged pool, etc).

     

    Where did this memory go then?  We have plenty of Available Pages (Free + Zero + Standby) for use.  So something is claiming memory it isn't actually using.  In this type of situation one of the things I immediately suspect is process working set minimums. Working set basically means the physical memory used by a process.

     

    So let's check.

     

    0: kd> !process 0 1

     

    <a lot of processes in this output>.

     

    PROCESS fffffa8008f76060

        SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

        DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:   0.

        Image: cscript.exe

        VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

        DeviceMap fffff8a000008a70

        Token                             fffff8a0046f9c50

        ElapsedTime                       9 Days 01:08:00.134

        UserTime                          00:00:00.000

        KernelTime                        00:00:00.015

        QuotaPoolUsage[PagedPool]         0

        QuotaPoolUsage[NonPagedPool]      0

        Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

        PeakWorkingSetSize                1454

        VirtualSize                       65 Mb

        PeakVirtualSize                   84 Mb

        PageFaultCount                    1628

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      0

     

    I have only shown one example process above for brevity's sake, but there were thousands returned.  241,423 to be precise.  None had abnormally high process working set minimums, but cumulatively their usage adds up.

     

    The “now” process working set is lower than the minimum working set.  How is that possible?  Well, the minimum and maximum are not hard limits, but suggested limits.  For example, the minimum working set is honored unless there is memory pressure, in which case it can be trimmed below this value.  There is a way to set the min and/or max as hard limits on specific processes by using the QUOTA_LIMITS_HARDWS_MIN_ENABLE flag via SetProcessWorkingSetSize.

     

    You can view if the minimum and maximum working set values are configured in the _EPROCESS->Vm->Flags structure.  Note these numbers are from another system as this structure was already torn down for the processes we were looking at.

     

    0: kd> dt _EPROCESS fffffa8008f76060 Vm

    nt!_EPROCESS

       +0x398 Vm : _MMSUPPORT

    0: kd> dt _MMSUPPORT fffffa8008f76060+0x398

    nt!_MMSUPPORT

       +0x000 WorkingSetMutex  : _EX_PUSH_LOCK

       +0x008 ExitGate         : 0xfffff880`00961000 _KGATE

       +0x010 AccessLog        : (null)

       +0x018 WorkingSetExpansionLinks : _LIST_ENTRY [ 0x00000000`00000000 - 0xfffffa80`08f3c410 ]

       +0x028 AgeDistribution  : [7] 0

       +0x044 MinimumWorkingSetSize : 0x32

       +0x048 WorkingSetSize   : 5

       +0x04c WorkingSetPrivateSize : 5

       +0x050 MaximumWorkingSetSize : 0x159

       +0x054 ChargedWslePages : 0

       +0x058 ActualWslePages  : 0

       +0x05c WorkingSetSizeOverhead : 0

       +0x060 PeakWorkingSetSize : 0x5ae

       +0x064 HardFaultCount   : 0x41

       +0x068 VmWorkingSetList : 0xfffff700`01080000 _MMWSL

       +0x070 NextPageColor    : 0x2dac

       +0x072 LastTrimStamp    : 0

       +0x074 PageFaultCount   : 0x65c

       +0x078 RepurposeCount   : 0x1e1

       +0x07c Spare            : [2] 0

       +0x084 Flags            : _MMSUPPORT_FLAGS

    0: kd> dt _MMSUPPORT_FLAGS fffffa8008f76060+0x398+0x84

    nt!_MMSUPPORT_FLAGS

       +0x000 WorkingSetType   : 0y000

       +0x000 ModwriterAttached : 0y0

       +0x000 TrimHard         : 0y0

       +0x000 MaximumWorkingSetHard : 0y0

       +0x000 ForceTrim        : 0y0

       +0x000 MinimumWorkingSetHard : 0y0

       +0x001 SessionMaster    : 0y0

       +0x001 TrimmerState     : 0y00

       +0x001 Reserved         : 0y0

       +0x001 PageStealers     : 0y0000

       +0x002 MemoryPriority   : 0y00000000 (0)

       +0x003 WsleDeleted      : 0y1

       +0x003 VmExiting        : 0y1

       +0x003 ExpansionFailed  : 0y0

       +0x003 Available        : 0y00000 (0)

     

    How about some more detail?

     

    0: kd> !process fffffa8008f76060

    PROCESS fffffa8008f76060

        SessionId: 0  Cid: 0adc    Peb: 7fffffda000  ParentCid: 0678

        DirBase: 204ac9000  ObjectTable: 00000000  HandleCount:  0.

        Image: cscript.exe

        VadRoot 0000000000000000 Vads 0 Clone 0 Private 1. Modified 3. Locked 0.

        DeviceMap fffff8a000008a70

        Token                             fffff8a0046f9c50

        ElapsedTime                       9 Days 01:08:00.134

        UserTime                          00:00:00.000

        KernelTime                        00:00:00.015

        QuotaPoolUsage[PagedPool]         0

        QuotaPoolUsage[NonPagedPool]      0

        Working Set Sizes (now,min,max)  (5, 50, 345) (20KB, 200KB, 1380KB)

        PeakWorkingSetSize                1454

        VirtualSize                       65 Mb

        PeakVirtualSize                   84 Mb

        PageFaultCount                    1628

        MemoryPriority                    BACKGROUND

        BasePriority                      8

        CommitCharge                      0

     

    No active threads

     

    0: kd> !object fffffa8008f76060

    Object: fffffa8008f76060  Type: (fffffa8006cccc90) Process

        ObjectHeader: fffffa8008f76030 (new version)

        HandleCount: 0  PointerCount: 1

     

    The highlighted information shows us that this process has no active threads left but the process object itself (and its 20KB working set use) were still hanging around because a kernel driver had a reference to the object that it never released.  Sampling other entries shows the server had been leaking process objects since it was booted.

     

    Unfortunately trying to directly track down pointer leaks on process objects is difficult and requires an instrumented kernel, so we tried to check the easy stuff first before going that route.  We know it has to be a kernel driver doing this (since it is a pointer and not a handle leak) so we looked at the list of 3rd party drivers installed.  Note: The driver names have been redacted.

     

    0: kd> lm

    start             end                 module name

    <snip>

    fffff880`04112000 fffff880`04121e00   driver1    (no symbols)   <-- no symbols usually means 3rd party       

    fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

    <snip>

     

    0: kd> lmvm driver1   

    Browse full module list

    start             end                 module name

    fffff880`04112000 fffff880`04121e00   driver1    (no symbols)          

        Loaded symbol image file: driver1.sys

        Image path: \SystemRoot\system32\DRIVERS\driver1.sys

        Image name: driver1.sys

        Browse all global symbols  functions  data

        Timestamp:        Wed Dec 13 12:09:32 2006 (458033CC)

        CheckSum:         0001669E

        ImageSize:        0000FE00

        Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

    0: kd> lmvm driver2

    Browse full module list

    start             end                 module name

    fffff880`04158000 fffff880`041a4c00   driver2    (no symbols)          

        Loaded symbol image file: driver2.sys

        Image path: \??\C:\Windows\system32\drivers\driver2.sys

        Image name: driver2.sys

        Browse all global symbols  functions  data

        Timestamp:        Thu Nov 30 12:12:07 2006 (456F10E7)

        CheckSum:         0004FE8E

        ImageSize:        0004CC00

        Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4

     

    Fortunately for both the customer and us we turned up a pair of drivers that predated Windows Vista (meaning they were designed for XP/2003) that raised an eyebrow.  Of course we need a more solid evidence link than just "it's an old driver", so I did a quick search of our internal KB.  This turned up several other customers who had these same drivers installed, experienced the same problem, then removed them and the problem went away.  That sounds like a pretty good evidence link. We implemented the same plan for this customer successfully.

  • Ntdebugging Blog

    Missing System Writer Case Explained

    • 3 Comments

    I worked on a case the other day where all I had was a procmon log and event logs to troubleshoot a problem where the System Writer did not appear in the VSSADMIN LIST WRITERS output. This might be review for the folks that know this component pretty well but I figured I would share how I did it for those that are not so familiar with the System Writer.

     

    WHAT WE KNOW:

    1. System State Backups fail
    2. Running a VSS List Writers does not list the system writer

     

    Looking at the event logs I found the error shown below. This error indicates there was a failure while “Writer Exposing its Metadata Context”. Each writer is responsible for providing a list of files, volumes, and other resources it is designed to protect. This list is called metadata and is formatted as XML. In the example we are working with the error is “Unexpected error calling routine XML document is too long”.  While helpful, this message alone does not provide us with a clear reason why the XML document is too long.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine XML document is too long. hr = 0x80070018, The program issued a command but the command length is incorrect. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    The second event that was logged was also not very helpful as it only indicates that the writer did have a failure. It looks like we are going to need to collect more data to figure this out.

     

    Event Type: Error

    Event Source: VSS

    Event ID: 8193

    Description: Volume Shadow Copy Service error: Unexpected error calling routine CreateVssExamineWriterMetadata. hr = 0x80042302, A Volume Shadow Copy Service component encountered an unexpected error. Check the Application event log for more information. . Operation: Writer Exposing its Metadata Context: Execution Context: Requestor Writer Instance ID: {636923A0-89C2-4823-ADEF-023A739B2515} Writer Class Id: {E8132975-6F93-4464-A53E-1050253AE220} Writer Name: System Writer

     

    From the error above we learned that there was an issue with the metadata file for the System Writer. These errors are among some of the most common issues seen with this writer. There are some not so well documented limitations within the writer due to some hard set limits on path depth and the number of files in a given path. These limitations are frequently exposed by the C:\Windows\Microsoft.Net\ path. Often, this path is used by development software like Visual Studio as well as server applications like IIS. Below I have listed a few known issues that should help provide some scope when troubleshooting System Writer issues.

     

    Known limitations and common points of failure:

    • More than 1,000 folders in a folder causes writer to fail during OnIdentify
    • More than 10,000 files in a folder causes writer to fail during OnIdentify (frequently C:\Windows\Microsoft.Net)
    • Permissions issues (frequently in C:\Windows\WinSXS and C:\Windows\Microsoft.Net)
    • Permissions issues with COM+ Event System Service
      • This service needs to be running and needs to have Network Service with Service User Rights

     

    What data can I capture to help me find where the issue is?

     

    The best place to start is with a Process Monitor (Procmon) capture. To prepare for this capture you will need to download Process Monitor, open the Services MMC snap-in, as well as open an administrative command prompt which will be used in a later step of the process.

     

    You should first stop the Cryptographic Services service using the Services MMC.

     

     

    Once stopped you will want to open Procmon, note that by default Procmon will start capturing when opened. Now that you have Procmon open and capturing data you will start the cryptographic service. This will allow you to capture any errors during service initialization. Once the service is started you will use the command prompt opened earlier to run “vssadmin.exe list writers”. This will signal the writers on the system to capture their metadata, which is a common place we see failures with the System Writer. When the vssadmin command completes, stop the Procmon capture and save this data to disk.

     

    Now that we have data how do we find the errors?

     

    Open your newly created Procmon file. First, add a new filter for the path field that contains “VSS\Diag”.

     

     

    We do this because this is the registry path that all writers will log to when entering and leaving major functions. Now that we have our filtered view we need to look for an entry from the System Writer. You can see the highlighted line below shows the “IDENTIFY” entry for the System Writer. From here we can ascertain the PID of the svchost.exe that the system writer is running in. We now want to include only this svchost. To accomplish this you can right click on the PID for the svchost.exe and select “Include ‘(PID)’”. 

     

     

    Now that we have found our System Writers svchost we will want to remove the filter for “VSS\Diag”; to do that you can return to the filter tool in Procmon and uncheck the box next to the entry.

     

     

    We now have a complete view of what this service was doing at the time it started and completed the OnIdentify. Our next step is to locate the IDENTIFY (Leave) entry as this is often a great marker for where your next clue will be. While in most cases we can’t directly see the error the writer hit we can make some educated connections based on the common issues we spoke about above. If we take a look at the events that took place just before the IDENTIFY (Leave) we can see that we were working in the C:\Windows\Microsoft.NET\assembly\ directory. This is one of the paths that the System Writer is responsible for protecting. As mentioned above, there are some known limitations to the depth of paths and number of files in the “C:\Windows\Microsoft.NET” folder. This is a great example of that limitation as seen in our procmon capture. The example below shows the IDENTIFY (Leave) with the line before that being where the last work was taking place. Meaning this is what the writer was touching when it failed.

     

     

    What does this tell us and what should we do next?

     

    Given the known path limitations, we need to check out the number of files and folders in the C:\Windows\Microsoft.Net\ path and see where the bloat is. Some of these files can be safely removed, however only files located in the Temp locations (Temporary ASP.NET Files) are safe to delete.

     

    Recently we released KB2807849 which addresses the issue shown above.

     

    There are other possible causes of the event log errors mentioned above, such as issues with file permissions. For those problems follow the same steps as above and you are likely to see the IDENTIFY (Leave) just after file access error is displayed in your procmon log. For these failures you will need to investigate the permissions on the file we failed on. Likely the file is missing permissions for the writer’s service account Network Service or Local System. All that is needed here is to add back the missing permissions for your failed file.

     

    While these issues can halt your nightly backups, it is often fairly easy to find the root cause. It just takes time and a little bit of experience with Process Monitor. 

     

    Good luck and successful backups!

  • Ntdebugging Blog

    Understanding Pool Corruption Part 2 – Special Pool for Buffer Overruns

    • 1 Comments

    In our previous article we discussed pool corruption that occurs when a driver writes too much data in a buffer.  In this article we will discuss how special pool can help identify the driver that writes too much data.

     

    Pool is typically organized to allow multiple drivers to store data in the same page of memory, as shown in Figure 1.  By allowing multiple drivers to share the same page, pool provides for an efficient use of the available kernel memory space.  However this sharing requires that each driver be careful in how it uses pool, any bugs where the driver uses pool improperly may corrupt the pool of other drivers and cause a crash.

     

    Figure 1 – Uncorrupted Pool

     

    With pool organized as shown in Figure 1, if DriverA allocates 100 bytes but writes 120 bytes it will overwrite the pool header and data stored by DriverB.  In Part 1 we demonstrated this type of buffer overflow using NotMyFault, but we were not able to identify which code had corrupted the pool.

     

    Figure 2 – Corrupted Pool

     

    To catch the driver that corrupted pool we can use special pool.  Special pool changes the organization of the pool so that each driver’s allocation is in a separate page of memory.  This helps prevent drivers from accidentally writing to another driver’s memory.  Special pool also configures the driver’s allocation at the end of the page and sets the next virtual page as a guard page by marking it as invalid.  The guard page causes an attempt to write past the end of the allocation to result in an immediate bugcheck.

     

    Special pool also fills the unused portion of the page with a repeating pattern, referred to as “slop bytes”.  These slop bytes will be checked when the page is freed, if any errors are found in the pattern a bugcheck will be generated to indicate that the memory was corrupted.  This type of corruption is not a buffer overflow, it may be an underflow or some other form of corruption.

     

    Figure 3 – Special Pool

     

    Because special pool stores each pool allocation in its own 4KB page, it causes an increase in memory usage.  When special pool is enabled the memory manager will configure a limit of how much special pool may be allocated on the system, when this limit is reached the normal pools will be used instead.  This limitation may be especially pronounced on 32-bit systems which have less kernel space than 64-bit systems.

     

    Now that we have explained how special pool works, we should use it.

     

    There are two methods to enable special pool.  Driver verifier allows special pool to be enabled on specific drivers.  The PoolTag registry value described in KB188831 allows special pool to be enabled for a particular pool tag.  Starting in Windows Vista and Windows Server 2008, driver verifier captures additional information for special pool allocations so this is typically the recommended method.

     

    To enable special pool using driver verifier use the following command line, or choose the option from the verifier GUI.  Use the /driver flag to specify drivers you want to verify, this is the place to list drivers you suspect as the cause of the problem.  You may want to verify drivers you have written and want to test or drivers you have recently updated on the system.  In the command line below I am only verifying myfault.sys.  A reboot is required to enable special pool.

     

    verifier /flags 1 /driver myfault.sys

     

    After enabling verifier and rebooting the system, repeat the activity that causes the crash.  For some problems the activity may just be to wait for a period of time.  For our demonstration we are running NotMyFault (see Part 1 for details).

     

    The crash resulting from a buffer overflow in special pool will be a stop 0xD6, DRIVER_PAGE_FAULT_BEYOND_END_OF_ALLOCATION.

     

    kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    DRIVER_PAGE_FAULT_BEYOND_END_OF_ALLOCATION (d6)

    N bytes of memory was allocated and more than N bytes are being referenced.

    This cannot be protected by try-except.

    When possible, the guilty driver's name (Unicode string) is printed on

    the bugcheck screen and saved in KiBugCheckDriver.

    Arguments:

    Arg1: fffff9800b5ff000, memory referenced

    Arg2: 0000000000000001, value 0 = read operation, 1 = write operation

    Arg3: fffff88004f834eb, if non-zero, the address which referenced memory.

    Arg4: 0000000000000000, (reserved)

     

    We can debug this crash and determine that notmyfault.sys wrote beyond its pool buffer.

     

    The call stack shows that myfault.sys accessed invalid memory and this generated a page fault.

     

    kd> k

    Child-SP          RetAddr           Call Site

    fffff880`04822658 fffff803`721333f1 nt!KeBugCheckEx

    fffff880`04822660 fffff803`720acacb nt! ?? ::FNODOBFM::`string'+0x33c2b

    fffff880`04822700 fffff803`7206feee nt!MmAccessFault+0x55b

    fffff880`04822840 fffff880`04f834eb nt!KiPageFault+0x16e

    fffff880`048229d0 fffff880`04f83727 myfault+0x14eb

    fffff880`04822b20 fffff803`72658a4a myfault+0x1727

    fffff880`04822b80 fffff803`724476c7 nt!IovCallDriver+0xba

    fffff880`04822bd0 fffff803`7245c8a6 nt!IopXxxControlFile+0x7e5

    fffff880`04822d60 fffff803`72071453 nt!NtDeviceIoControlFile+0x56

    fffff880`04822dd0 000007fc`4fe22c5a nt!KiSystemServiceCopyEnd+0x13

    00000000`004debb8 00000000`00000000 0x000007fc`4fe22c5a

     

    The !pool command shows that the address being referenced by myfault.sys is special pool.

     

    kd> !pool fffff9800b5ff000

    Pool page fffff9800b5ff000 region is Special pool

    fffff9800b5ff000: Unable to get contents of special pool block

     

    The page table entry shows that the address is not valid.  This is the guard page used by special pool to catch overruns.

     

    kd> !pte fffff9800b5ff000

                                               VA fffff9800b5ff000

    PXE at FFFFF6FB7DBEDF98    PPE at FFFFF6FB7DBF3000    PDE at FFFFF6FB7E6002D0    PTE at FFFFF6FCC005AFF8

    contains 0000000001B8F863  contains 000000000138E863  contains 000000001A6A1863  contains 0000000000000000

    pfn 1b8f      ---DA--KWEV  pfn 138e      ---DA--KWEV  pfn 1a6a1     ---DA--KWEV  not valid

     

    The allocation prior to this memory is an 800 byte block of non paged pool tagged as “Wrap”.  “Wrap” is the tag used by verifier when pool is allocated without a tag, it is the equivalent to the “None” tag we saw in Part 1.

     

    kd> !pool fffff9800b5ff000-1000

    Pool page fffff9800b5fe000 region is Special pool

    *fffff9800b5fe000 size:  800 data: fffff9800b5fe800 (NonPaged) *Wrap

                Owning component : Unknown (update pooltag.txt)

     

    Special pool is an effective mechanism to track down buffer overflow pool corruption.  It can also be used to catch other types of pool corruption which we will discuss in future articles.

  • Ntdebugging Blog

    Understanding Pool Corruption Part 1 – Buffer Overflows

    • 6 Comments

    Before we can discuss pool corruption we must understand what pool is.  Pool is kernel mode memory used as a storage space for drivers.  Pool is organized in a similar way to how you might use a notepad when taking notes from a lecture or a book.  Some notes may be 1 line, others may be many lines.  Many different notes are on the same page.

     

    Memory is also organized into pages, typically a page of memory is 4KB.  The Windows memory manager breaks up this 4KB page into smaller blocks.  One block may be as small as 8 bytes or possibly much larger.  Each of these blocks exists side by side with other blocks.

     

    The !pool command can be used to see the pool blocks stored in a page.

     

    kd> !pool fffffa8003f42000

    Pool page fffffa8003f42000 region is Nonpaged pool

    *fffffa8003f42000 size:  410 previous size:    0  (Free)      *Irp

                Pooltag Irp  : Io, IRP packets

    fffffa8003f42410 size:   40 previous size:  410  (Allocated)  MmSe

    fffffa8003f42450 size:  150 previous size:   40  (Allocated)  File

    fffffa8003f425a0 size:   80 previous size:  150  (Allocated)  Even

    fffffa8003f42620 size:   c0 previous size:   80  (Allocated)  EtwR

    fffffa8003f426e0 size:   d0 previous size:   c0  (Allocated)  CcBc

    fffffa8003f427b0 size:   d0 previous size:   d0  (Allocated)  CcBc

    fffffa8003f42880 size:   20 previous size:   d0  (Free)       Free

    fffffa8003f428a0 size:   d0 previous size:   20  (Allocated)  Wait

    fffffa8003f42970 size:   80 previous size:   d0  (Allocated)  CM44

    fffffa8003f429f0 size:   80 previous size:   80  (Allocated)  Even

    fffffa8003f42a70 size:   80 previous size:   80  (Allocated)  Even

    fffffa8003f42af0 size:   d0 previous size:   80  (Allocated)  Wait

    fffffa8003f42bc0 size:   80 previous size:   d0  (Allocated)  CM44

    fffffa8003f42c40 size:   d0 previous size:   80  (Allocated)  Wait

    fffffa8003f42d10 size:  230 previous size:   d0  (Allocated)  ALPC

    fffffa8003f42f40 size:   c0 previous size:  230  (Allocated)  EtwR

     

    Because many pool allocations are stored in the same page, it is critical that every driver only use the space they have allocated.  If DriverA uses more space than it allocated they will write into the next driver’s space (DriverB) and corrupt DriverB’s data.  This overwrite into the next driver’s space is called a buffer overflow.  Later either the memory manager or DriverB will attempt to use this corrupted memory and will encounter unexpected information.  This unexpected information typically results in a blue screen.

     

    The NotMyFault application from Sysinternals has an option to force a buffer overflow.  This can be used to demonstrate pool corruption.  Choosing the “Buffer overflow” option and clicking “Crash” will cause a buffer overflow in pool.  The system may not immediately blue screen after clicking the Crash button.  The system will remain stable until something attempts to use the corrupted memory.  Using the system will often eventually result in a blue screen.

     

    NotMyFault

     

    Often pool corruption appears as a stop 0x19 BAD_POOL_HEADER or stop 0xC2 BAD_POOL_CALLER.  These stop codes make it easy to determine that pool corruption is involved in the crash.  However, the results of accessing unexpected memory can vary widely, as a result pool corruption can result in many different types of bugchecks.

     

    As with any blue screen dump analysis the best place to start is with !analyze -v.  This command will display the stop code and parameters, and do some basic interpretation of the crash.

     

    kd> !analyze -v

    *******************************************************************************

    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *

    *******************************************************************************

     

    SYSTEM_SERVICE_EXCEPTION (3b)

    An exception happened while executing a system service routine.

    Arguments:

    Arg1: 00000000c0000005, Exception code that caused the bugcheck

    Arg2: fffff8009267244a, Address of the instruction which caused the bugcheck

    Arg3: fffff88004763560, Address of the context record for the exception that caused the bugcheck

    Arg4: 0000000000000000, zero.

     

    In my example the bugcheck was a stop 0x3B SYSTEM_SERVICE_EXCEPTION.  The first parameter of this stop code is c0000005, which is a status code for an access violation.  An access violation is an attempt to access invalid memory (this error is not related to permissions).  Status codes can be looked up in the WDK header ntstatus.h.

     

    The !analyze -v command also provides a helpful shortcut to get into the context of the failure.

     

    CONTEXT:  fffff88004763560 -- (.cxr 0xfffff88004763560;r)

     

    Running this command shows us the registers at the time of the crash.

     

    kd> .cxr 0xfffff88004763560

    rax=4f4f4f4f4f4f4f4f rbx=fffff80092690460 rcx=fffff800926fbc60

    rdx=0000000000000000 rsi=0000000000001000 rdi=0000000000000000

    rip=fffff8009267244a rsp=fffff88004763f60 rbp=fffff8009268fb40

    r8=fffffa8001a1b820  r9=0000000000000001 r10=fffff800926fbc60

    r11=0000000000000011 r12=0000000000000000 r13=fffff8009268fb48

    r14=0000000000000012 r15=000000006374504d

    iopl=0         nv up ei pl nz na po nc

    cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010206

    nt!ExAllocatePoolWithTag+0x442:

    fffff800`9267244a 4c8b4808        mov     r9,qword ptr [rax+8] ds:002b:4f4f4f4f`4f4f4f57=????????????????

     

    From the above output we can see that the crash occurred in ExAllocatePoolWithTag, which is a good indication that the crash is due to pool corruption.  Often an engineer looking at a dump will stop at this point and conclude that a crash was caused by corruption, however we can go further.

     

    The instruction that we failed on was dereferencing rax+8.  The rax register contains 4f4f4f4f4f4f4f4f, which does not fit with the canonical form required for pointers on x64 systems.  This tells us that the system crashed because the data in rax is expected to be a pointer but it is not one.

     

    To determine why rax does not contain the expected data we must examine the instructions prior to where the failure occurred.

     

    kd> ub .

    nt!KzAcquireQueuedSpinLock [inlined in nt!ExAllocatePoolWithTag+0x421]:

    fffff800`92672429 488d542440      lea     rdx,[rsp+40h]

    fffff800`9267242e 49875500        xchg    rdx,qword ptr [r13]

    fffff800`92672432 4885d2          test    rdx,rdx

    fffff800`92672435 0f85c3030000    jne     nt!ExAllocatePoolWithTag+0x7ec (fffff800`926727fe)

    fffff800`9267243b 48391b          cmp     qword ptr [rbx],rbx

    fffff800`9267243e 0f8464060000    je      nt!ExAllocatePoolWithTag+0xa94 (fffff800`92672aa8)

    fffff800`92672444 4c8b03          mov     r8,qword ptr [rbx]

    fffff800`92672447 498b00          mov     rax,qword ptr [r8]

     

    The assembly shows that rax originated from the data pointed to by r8.  The .cxr command we ran earlier shows that r8 is fffffa8001a1b820.  If we examine the data at fffffa8001a1b820 we see that it matches the contents of rax, which confirms this memory is the source of the unexpected data in rax.

     

    kd> dq fffffa8001a1b820 l1

    fffffa80`01a1b820  4f4f4f4f`4f4f4f4f

     

    To determine if this unexpected data is caused by pool corruption we can use the !pool command.

     

    kd> !pool fffffa8001a1b820

    Pool page fffffa8001a1b820 region is Nonpaged pool

     fffffa8001a1b000 size:  810 previous size:    0  (Allocated)  None

     

    fffffa8001a1b810 doesn't look like a valid small pool allocation, checking to see

    if the entire page is actually part of a large page allocation...

     

    fffffa8001a1b810 is not a valid large pool allocation, checking large session pool...

    fffffa8001a1b810 is freed (or corrupt) pool

    Bad previous allocation size @fffffa8001a1b810, last size was 81

     

    ***

    *** An error (or corruption) in the pool was detected;

    *** Attempting to diagnose the problem.

    ***

    *** Use !poolval fffffa8001a1b000 for more details.

     

     

    Pool page [ fffffa8001a1b000 ] is __inVALID.

     

    Analyzing linked list...

    [ fffffa8001a1b000 --> fffffa8001a1b010 (size = 0x10 bytes)]: Corrupt region

     

     

    Scanning for single bit errors...

     

    None found

     

    The above output does not look like the !pool command we used earlier.   This output shows corruption to the pool header which prevented the command from walking the chain of allocations.

     

    The above output shows that there is an allocation at fffffa8001a1b000 of size 810.  If we look at this memory we should see a pool header.  Instead what we see is a pattern of 4f4f4f4f`4f4f4f4f.

     

    kd> dq fffffa8001a1b000 + 810

    fffffa80`01a1b810  4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f

    fffffa80`01a1b820  4f4f4f4f`4f4f4f4f 4f4f4f4f`4f4f4f4f

    fffffa80`01a1b830  4f4f4f4f`4f4f4f4f 00574f4c`46524556

    fffffa80`01a1b840  00000000`00000000 00000000`00000000

    fffffa80`01a1b850  00000000`00000000 00000000`00000000

    fffffa80`01a1b860  00000000`00000000 00000000`00000000

    fffffa80`01a1b870  00000000`00000000 00000000`00000000

    fffffa80`01a1b880  00000000`00000000 00000000`00000000

     

    At this point we can be confident that the system crashed because of pool corruption.

     

    Because the corruption occurred in the past, and a dump is a snapshot of the current state of the system, there is no concrete evidence to indicate how the memory came to be corrupted.  It is possible the driver that allocated the pool block immediately preceding the corruption is the one that wrote to the wrong location and caused this corruption.  This pool block is marked with the tag “None”, we can search for this tag in memory to determine which drivers use it.

     

    kd> !for_each_module s -a @#Base @#End "None"

    fffff800`92411bc2  4e 6f 6e 65 e9 45 04 26-00 90 90 90 90 90 90 90  None.E.&........

    kd> u fffff800`92411bc2-1

    nt!ExAllocatePool+0x1:

    fffff800`92411bc1 b84e6f6e65      mov     eax,656E6F4Eh

    fffff800`92411bc6 e945042600      jmp     nt!ExAllocatePoolWithTag (fffff800`92672010)

    fffff800`92411bcb 90              nop

     

    The file Pooltag.txt lists the pool tags used for pool allocations by kernel-mode components and drivers supplied with Windows, the associated file or component (if known), and the name of the component. Pooltag.txt is installed with Debugging Tools for Windows (in the triage folder) and with the Windows WDK (in \tools\other\platform\poolmon).  Pooltag.txt shows the following for this tag:

     

    None - <unknown>    - call to ExAllocatePool

     

    Unfortunately what we find is that this tag is used when a driver calls ExAllocatePool, which does not specify a tag.  This does not allow us to determine what driver allocated the block prior to the corruption.  Even if we could tie the tag back to a driver it may not be sufficient to conclude that the driver using this tag is the one that corrupted the memory.

     

    The next step should be to enable special pool and hope to catch the corruptor in the act.  We will discuss special pool in our next article.

Page 2 of 24 (237 items) 12345»