• Ntdebugging Blog

    Understanding ARM Assembly Part 3

    • 1 Comments

    My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 3 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  In part 2 we talked about how Windows utilizes that ARM processor.  In this part we will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.

     

    Calling Conventions

    In ARM there is only one calling convention.  The calling convention for ARM is simple.  The first four 32 bit or smaller variables are passed in R0-R3.  The remaining values go onto the stack.  If any of the first four variables are 8 or 16 bit in size then they will be padded with zeros to fill the 32-bit register.  If any of the first four variables are 64 bit in size then they have to be 64 bit aligned.  That means that the variable will be split across an even/odd register pair.  Example is R0/R1 or R2/R3.  Here is an example:

      Registers                                                                                                      Stack

      R0

      R1

      R2

      R3

       

       

       

       

       

       

       

    1. Foo (int I0, int I1, int I2, int I3)
    2. Registers                                                                                                      Stack

      I0

      I1

      I2

      I3

       

       

       

       

       

       

       

    3. Foo (int I0, double D, int I1)
    4. Registers                                                                                                      Stack

      I0

      unused

      D

      D

       

      I1

       

       

       

       

       

    5. Foo (int I0, int I1, double D)
    6. Registers                                                                                                      Stack

      I0

      I1

      D

      D

       

       

       

       

       

       

     

    In the first example the function Foo takes four integer values.  All of these are passed in the registers R0 - R3.  This one is pretty simple.

     

    In the second example the function Foo takes an integer, a double, and another integer.  The first integer is put into R0.  However note that the double has to be in an even/odd pair and therefore R1 is unused, and the double gets put into R2/R3.  The last integer is pushed onto the stack.  This leaves R1 unused.  Programmers are suggested to not use this type.  Instead organize your variables to where they will fit like in the third example.  Also in this example the stack has to be word aligned, so there will be an additional unused word pushed and popped in order to keep the alignment.  Also note that on ARM that a Byte is 8 bits, a Halfword is 16 bits, and a Word is 32 bits.

     

    In the third example the function Foo takes two integers and a double.  As you can see the first two variables are integers and they go in R0 and R1 respectively.  The last variable the double will then be aligned to go into R2/R3.

     

    The registers R4-R11 are used to hold the values of the local variables of a subroutine.  A subroutine is required to preserve on the stack the contents of the registers R4-R8, R10, R11, and SP.

     

    Return values are always in R0 unless they are 64 bits in size then a combination of R0 and R1 is used.

     

    Calling convention for floating point operations are pretty much the same.  A function can have up to 16 single-precision values in S0-S15, or 8 double-precision values in D0-D7, or 4 SIMD vectors in Q0-Q3.  Example if you have a function that takes the following combination:

    Float, double, double, float

     

    They will go into S0, D1, D2, S1 respectively.  These are aggressively back-filled.

     

    Floating point return values are in S0/D0/Q0 as appropriate by size.

     

    This means that S16-S31/D8-D31/Q4-Q15 are volatile.

     

    Prolog and Epilog

    The Prolog on an ARM processor does the same thing as the x86 processor, it stores registers on the stack and adjusts the frame pointer.  Let`s look at a simple example from hal!KfLowerIrql.

     

    Prolog:

    push        {r3,r4,r11,lr}  ; save non-volatiles regs used, r11, lr
    addw        r11,sp,#8       ; new frame pointer value in r11...

    ...                         ; stack used in prolog is multiple of 8

     

    As you can see the push instruction is different than x86.  On x86 we would have four push instructions to do the same thing that ARM is doing in one instruction.  This stores the registers in consecutive memory locations ending just below the address in SP, and updates SP to point to the start of the stored location.  The lowest numbered register is stored in the lowest memory address, through to the highest numbered register to the highest memory address.  We can see that here:

     

    1: kd> r

    r0=0000000f  r1=e1070180  r2=00000000  r3=e0eb3675  r4=e1048cc8  r5=e10651fc

    r6=00001000  r7=0000006a  r8=c5561d10  r9=0000000f r10=e10acc80 r11=c5561d08

    r12=ef890f1c  sp=c5561cc8  lr=e1298a0f  pc=e0eb3678 psr=400001b3 -Z--- Thumb

    hal!KfLowerIrql+0x4:

    1: kd> dds c5561cc8 c5561d08

    c5561cc8  e0eb3675   <-- r3

    c5561ccc  e1048cc8   <-- r4

    c5561cd0  c5561d08   <-- r11

    c5561cd4  e1298a0f   <-- lr

     

    The addw instruction is setting up the new frame pointer.  This will add 8 to the value in sp, and store that in r11 which is the frame pointer.  Here is what that looks like in the debugger:

     

    kd> r

    r0=0000000f  r1=00000002  r2=00000002  r3=e133b675  r4=77e31f15  r5=02cc9ad5

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22cb710 r11=e22cb5b8

    r12=26ebcf96  sp=e22cb5b0  lr=e0f2560b  pc=e133b67c psr=400000b3 -Z--- Thumb

    hal!KfLowerIrql+0x8:

     

    As you can see r11 is now 8 higher than sp.

     

    Now let`s look at the Epilog for hal!KfLowerIrql.  It is pretty simple as it is one command.

     

    Epilog:

    pop         {r3,r4,r11,pc}  ; restore non-volatile regs, r11, return

     

    This is going to pop the first three registers from the stack back into their original registers.  However the last one is poping what was the link register (lr) into the program counter (pc).  This acts as a return, performing a similar function as what the RET instruction does on x86 but without using a unique instruction.  Program flow is controlled by manipulating the pc register.  Here is what this looks like in the debugger.

     

    The registers before the pop instruction runs:

    kd> r

    r0=0000000f  r1=00000006  r2=00000000  r3=e1035000  r4=0000000f  r5=306f0a07

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c9260 r11=e22c9108

    r12=26ebaae6  sp=e22c9100  lr=e0f2560b  pc=e133b6b4 psr=200000b3 --C-- Thumb

    hal!KfLowerIrql+0x40:

    e133b6b4 e8bd8818 pop         {r3,r4,r11,pc}

     

    The registers after the pop instruction runs:

    kd> r

    r0=0000000f  r1=00000006  r2=00000000  r3=e133b675  r4=51cae4a2  r5=2aede545

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c8d20 r11=e22c8c10

    r12=26eba5a6  sp=e22c8bd0  lr=e0f2560b  pc=e0f2560a psr=200000b3 --C—Thumb

     

    Now we are going to complicate this a bit by showing a function that has local variables, NtCreateFile.

     

    Prolog:

    push        {r4,r5,r11,lr}  ; save non-volatiles regs used, r11, lr    

    addw        r11,sp,#8       ; new frame pointer value in r11
    sub         sp,sp,#0x30     ; local variables

    ...                         ; stack used in prolog is multiple of 8

     

    Notice that this looks the same as the previous prolog, but one line is added.  The sub sp,sp,#0x30 is used to make stack space available for local variables.  This adds one instruction to the Epilog as well.

     

    Epilog :

    add          sp,sp,#0x30     ; cleanup local variables
    pop         {r4,r5,r11,pc}   ; restore non-volatile regs, r11, return

     

    The add sp,sp,#0x30 is used to clean up the stack of the local variables.

     

    One more prolog/epilog example.  This one is of IopCreateFile.  It saves the arguments that come in to the stack first.

     

    Prolog :

    push        {r0-r3}           ; save r0-r3
    push        {r4-r11,lr}       ; save non-volatiles r4-r10, r11, lr
    addw       r11,sp,#0x1c       ; new frame pointer value in r11
    sub          sp,sp,#0x3c      ; local variables

    ...                           ; stack used in prolog is multiple of 8

     

    As you can see this prolog is mostly the same, there is just one additional line for pushing the r0-r3 argument registers to the stack.

     

    The epilog for this one is a little different.

     

    Epilog:

    add         sp,sp,#0x4c        ; cleanup local variables from stack
    pop         {r4-r11}           ; restore non-volatiles, frame pointer r11
    ldr          pc,[sp],#0x14     ; return and cleanup 0x14 bytes (lr,r0-r3)

     

    Notice that the pop is not putting lr into pc for a return.  Instead the last statement is taking care of the pc register.  This instruction is calculating the pc address by adding 14 to the value in sp, and putting that into pc.  This cleans up the arguments and lr from the stack at the same time.  This ldr instruction is similar to the ret instruction on x86.

     

    The last thing we are going to cover is called a "Leaf function".  A Leaf function executes in the context of the caller.  It does not have a prolog and does not use the stack.  It only uses volatile registers r0-r3, and r12.  It returns via the "bx lr" command.  Example of this is KeGetCurrentIrql.  Here is what it looks like in the debugger.

     

    kd> uf hal!KeGetCurrentIrql

    hal!KeGetCurrentIrql  211 e132b650 f3ef8300 mrs         r3,cpsr

      216 e132b654 f0130f80 tst         r3,#0x80

      216 e132b658 d103     bne         hal!KeGetCurrentIrql+0x12 (e132b662)

    hal!KeGetCurrentIrql+0xa

      216 e132b65a b672     cpsid       i

      216 e132b65c 0000     movs        r0,r0

      216 e132b65e 2201     movs        r2,#1

      216 e132b660 e000     b           hal!KeGetCurrentIrql+0x14 (e132b664)

    hal!KeGetCurrentIrql+0x12

      216 e132b662 2200     movs        r2,#0

    hal!KeGetCurrentIrql+0x14

      217 e132b664 ee1d3f90 mrc         p15,#0,r3,c13,c0,#4

      217 e132b668 7f18     ldrb        r0,[r3,#0x1C]

      218 e132b66a b10a     cbz         r2,hal!KeGetCurrentIrql+0x20 (e132b670)

    hal!KeGetCurrentIrql+0x1c

      218 e132b66c b662     cpsie       i

      218 e132b66e 0000     movs        r0,r0

    hal!KeGetCurrentIrql+0x20

      220 e132b670 4770     bx          lr

     

    The stack must remain 4 byte aligned at all times, and must be 8 byte aligned in any function boundary.  This is due to the frequent use of interlocked operations on 64-bit stack variables.

     

    Functions which need to use a frame pointer (for example, if alloca is used) or which dynamically change the stack pointer within their body, must set up the frame pointer in the function prologue and leave it unchanged until the epilog. Functions which do not need a frame pointer must perform all stack updating in the prolog and leave the SP unchanged until the epilog.

     

    Rebuilding the Stack

    Here we are going to discuss how to rebuild the stack from the frame pointer.

     

    The frame pointer points to the top of the stack area for the current function, or it is zero if not being used.  By using the frame pointer and storing it at the same offset for every function call, it creates a singly linked list of activation records.

     

    The frame pointer register points to the stack backtrace structure for the currently executing function. 

     

    The saved frame pointer value is (zero or) a pointer to the stack backtrace structure created by the function which called the current function. 

     

    The saved frame pointer in this structure is a pointer to the stack backtrace structure for the function that called the function that called the current function; and so on back until the first function. 

     

     

    In the below diagram Main calls Foo which calls Bar

    image002

     

    For more information about ARM Debugging check out this article from T.Roy at Code Machine:

    http://codemachine.com/article_armasm.html

  • Ntdebugging Blog

    Debugging a Windows 8.1 Store App Crash Dump (Part 2)

    • 0 Comments

    In Part 1, we covered the debugging of a Windows Store Application crash dump that contains a Stowed Exceptions Version 1 (SE01) structure.

     

    This post continues on from Part 1, covering the changes introduced in March 2014. These Windows Updates changed the way language exceptions (RoOriginateLanguageException) are recorded in Windows Store Application crash dump files. The new Stowed Exception Version 2 (SE02) structure adds additional fields that directly associate the exception with a language exception object.

     

    You’ll recall from the Part 1 that the CLR Exception is loosely associated with the Stowed Exception v1 structure by comparing the HRESULT of the Stowed Exception with the HRESULT of the last CLR Exception on the default thread (the exception record thread). V2 makes this relationship direct. You’ll discover that the Last CLR Exception no longer exists in the v2 dump and that it must be referenced directly by the address stored in the Stowed Exception.

     

    The direct association was added to v2 to also aid triage dump carving (done by Windows Error Reporting). It allows WER to explicitly add the memory associated with the relevant Language (CLR) Exception. This eliminates the risk of the garbage collector freeing the memory associated with the last CLR Exception before the dump is taken.  This also helps identify which exception is related to the final crash, which can be difficult when there are multiple exceptions in the dump.

     

    Debug Steps

    The steps to debug a v2 structure are similar to v1. You first determine the number of stowed exception entries (.exr -1), look at the header to determine the version, display the array of stowed exceptions cast to the correct type (dt -aN …), and then extract the native stack (dpS) or text (du) for each entry.

     

    Instead of then comparing the HRESULT to the last CLR Exception (!sos.pe), you use the Nested Exception member to get to the innermost CLR Exception. Due to way object pointers are handled by the CLR, the address is a CCW (COM Callable Wrapper) address, not a CLR object address. To get the CLR object’s address, you use the !sos.dumpccw command. This provides the CLR object address, which can be passed to the !sos.pe command to display the exception.

     

    OK, let’s do all of that, showing the commands and data fields of note along the way. (A lot of this is similar to the previous post.)

     

    If not done already, set your symbol path to the Microsoft Public Symbol server:

    0:003> .sympath SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\Symbols*http://msdl.microsoft.com/download/symbols

    ************* Symbol Path validation summary **************

    Response                         Time (ms)     Location

    Deferred                                       SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

     

    Force the load of the symbols using the .reload /f command:

    0:003> .reload /f

    ...

     

    The next step is to display the pointer array as the original structure type. First, we need to know what structure to cast the pointer array to. Using the Parameter[0] value from .exr -1, we will generate a dt command that will display the header of the first record. We use Parameter[0] as the address in this command.

    dt  <Parameter[0]> combase!STOWED_EXCEPTION_INFORMATION_HEADER*

     

    Here’s an example:

    0:003> .exr -1

    ExceptionAddress: 7575b152 (combase!RoFailFastWithErrorContextInternal+0x0000010b)

       ExceptionCode: c000027b

      ExceptionFlags: 00000001

    NumberParameters: 2

       Parameter[0]: 00c6d3d0

       Parameter[1]: 00000002

     

    0:003> dt 00c6d3d0 combase!_STOWED_EXCEPTION_INFORMATION_HEADER*

    0x07a690dc

       +0x000 Size             : 0x28

       +0x004 Signature        : 0x53453032

     

    The value of the Signature member (0x53453031) is converted to a string using .formats <value>.

    0:003> .formats 0x53453032

    Evaluate expression:

      Hex:     53453032

      Decimal: 1397043250

      Octal:   12321230062

      Binary:  01010011 01000101 00110000 00110010

      Chars:   SE02

      Time:    Wed Apr 09 04:34:10 2014

      Float:   low 8.46917e+011 high 0

      Double:  6.90231e-315

    • “SE01” maps to combase!STOWED_EXCEPTION_INFORMATION_V1
    • “SE02” maps to combase!STOWED_EXCEPTION_INFORMATION_V2

     

    Now that we know the type, we can again use the values from .exr -1 to generate a dt command that will display each record. We use the Parameter[0] as the address, and Parameter[1] as the count in the command. We add a “P” to the start of the type as this is an array of pointers to the type, not structures packed next to each other.

     

    In this example, there are 2 pointers, so 2 records are displayed:

    dt -a<Parameter[1]> <Parameter[0]> combase!PSTOWED_EXCEPTION_INFORMATION_V2

     

    Note, there is no space between the -a and <Parameter[1]>.

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0

    ---------------------------------------------

    0x07a690dc

       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000100000001111 (0x80f)

       +0x010 ExceptionAddress : 0x756b3bff Void

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 3

       +0x01c StackTrace       : 0x0619a368 Void

       +0x010 ErrorText        : 0x756b3bff  "???"

       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void

     

    [1] @ 00c6d3d4

    ---------------------------------------------

    0x0619b6a8

       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000000000000000 (0)

       +0x010 ExceptionAddress : (null)

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 0x3f

       +0x01c StackTrace       : 0x0639bf4c Void

       +0x010 ErrorText        : (null)

       +0x020 NestedExceptionType : 0

       +0x024 NestedException  : (null)

     

    Native Call Stack

    Regardless of whether the error code (ResultCode) is known or unknown, it is useful to determine the location of the (native) issue by viewing the (native) call stack.

     

    Symbol Pointers

    If the ExceptionForm member has a value of 0y01, the structure’s union represents a call stack.

     

    Unlike call stacks associated with threads, where the symbol pointers are placed throughout the stack next to local variables, these symbols pointers are packed tightly at the address specified in the StackTrace member. Think of it as an array of EBP addresses. The dpS command is used to display the call stack.

    • It is important to include a limit (L) as the call stack is regularly longer than the default 10 rows displayed by dpS. The limit’s value is in the StackTraceWords member.
    • Note that capital S is used (dps vs dpS) because we want to omit the first column normally displayed by dps; the location of the symbol pointer is irrelevant.
    • If you aren‘t using the same bitness debugger as the target’s bitness, use ddS for StackTraceWordSize = 4 (32-bit), and dqS for StackTraceWordSize = 8 (64-bit).

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0

    ---------------------------------------------

    0x07a690dc

       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000100000001111 (0x80f)

       +0x010 ExceptionAddress : 0x756b3bff Void

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 3

       +0x01c StackTrace       : 0x0619a368 Void

       +0x010 ErrorText        : 0x756b3bff  "???"

       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void

    ...

    0:003> dpS 0x619a368 L3

    756ea9f1 combase!RoOriginateLanguageException+0x3b

    63b2b04d clr!SetupErrorInfo+0x1e1

    63bf4511 clr!MarshalNative::GetHRForException_WinRT+0x7d

     

    Unicode String Pointer

    If the ExceptionForm member has a value of 0y10, the structure’s union represents an error message.

     

    The call stack is (hopefully) contained within the Unicode string pointed at by the ErrorText member. As the text is defined by the caller, the existence of a call stack text isn’t guaranteed.

    0:003> dt –a1 13f117e0 combase!PSTOWED_EXCEPTION_INFORMATION_V1

    [0] @ 13f117e0

    ---------------------------------------------

    0x0471f3c0

       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 8000ffff

       +0x00c ExceptionForm    : 0y10

       +0x00c ThreadId         : 0y000000000000000000010101110100 (0x574)

       +0x010 ExceptionAddress : 0x0de38f7c Void

       +0x014 StackTraceWordSize : 0

       +0x018 StackTraceWords  : 0

       +0x01c StackTrace       : (null)

       +0x010 ErrorText        : 0x0de38f7c  "System.Exception..   at Windows.UI.Xaml.VisualStateManager.GoToState(Control control, String stateName, Boolean useTransitions)..   at MyBadApp.Common.LayoutAwarePage.InvalidateVisualState()..   at MyBadApp.Common.LayoutAwarePage.WindowSizeChanged(Object sender, WindowSizeChangedEventArgs e)"

     

    Note - These records aren’t used with v2 language exceptions (or if they are, they are extremely rare based on the Windows Error Reporting telemetry).

     

    Nested Exceptions

    The new fields in the v2 structure are the NestedExceptionType and NestedException members. The NestedExceptionType member is one of the following values. Much like the Signature field, you can use .formats <value> to see the characters each code represents. The possible values and their associated meaning are:

    • W32E – Win32 Exception – points to an EXCEPTION_RECORD structure
    • STOW – Stowed Exception – points to a STOWED_EXCEPTION_INFORMATION_* structure
    • CLR1 – CLR Object – points (directly) to a CLR Object
    • LEO1 – Language Exception Object – points indirectly to a CLR Exception object

     

    LEO1 is the only style being generated by Windows Error Reporting for CLR Exceptions raised in Windows Store Applications.

     

    Looking at the example dump file we have been using, it can be seen that the first Stowed Exception has values for the NestedException and NestedExceptionType fields, and they are NULL in the second. Using .formats tells us that the NestedExceptionType member is of type “LEO1”. Note that this is displayed backwards in the output below, in accordance with little-endian order of Intel memory layout.

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0

    ---------------------------------------------

    0x07a690dc

    ...

       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void

    ...

    0:003> .formats 0x314f454c

    Evaluate expression:

      Hex:     314f454c

      Decimal: 827278668

      Octal:   06123642514

      Binary:  00110001 01001111 01000101 01001100

      Chars:   1OEL

      Time:    Tue Mar 19 16:37:48 1996

      Float:   low 3.01619e-009 high 0

      Double:  4.0873e-315

     

    Passing the address to !sos.dumpccw provides the CLR Exception object’s address.

    0:003> !sos.dumpccw 0x063a95d4

    CCW:               0499f880

    Managed object:    02517288

    Outer IUnknown:    00000000

    Ref count:         1

    Flags:            

    RefCounted Handle: 00a31478 (STRONG)

    COM interface pointers:

          IP       MT Type

     

    The address can be used with !sos.pe to display the CLR Exception object. The call stack that the failure investigation should focus on is in this output.

    0:003> !sos.pe 02517288

    Exception object: 02517288

    Exception type:   System.NotImplementedException

    Message:          The method or operation is not implemented.

    InnerException:   <none>

    StackTrace (generated):

        SP       IP       Function

        04F2E38C 00B81382 CrashStore!CrashStore.MainPage.Load_Click_1(System.Object, Windows.UI.Xaml.RoutedEventArgs)+0x62

     

    StackTraceString: <none>

    HResult: 80004001

     

    There you have it. This is the CLR Exception that you need to find to start your code analysis or to point you in the right direction when beginning tracing.

     

    But what if SOS is not available?

    What do you do if SOS isn’t available? You can check if it is loaded by running the .chain command, and you can check if it is functional by running !sos.dumpccw command (without a parameter).

     

    Firstly, make sure you are using the same bitness of the debugger as the bitness of the target.

     

    If the dump says “x86” or “ARM (Thumb2)” in the version command or the initial debug spew, use the 32bit debugger.

    Windows 8 Version 9600 MP (4 procs) Free x86 compatible

     

    If the dump says “x64” in the version command or the initial debug spew, use the 64bit debugger.

    Windows 8 Version 9200 MP (4 procs) Free x64

     

    If you still don’t have SOS loaded (or working) after matching the bitness, or you get one of the following errors, you’ll have to debug the dump on a system with the same version of the CLR installed. Some CLR versions weren’t indexed and this causes the automatic download of sos.dll and mscordacwks.dll to fail.

    0:003> !sos.dumpccw

    Failed to load data access DLL, 0x80004005

    Verify that 1) you have a recent build of the debugger (6.2.14 or newer)

                2) the file mscordacwks.dll that matches your version of clr.dll is

                    in the version directory or on the symbol path

                3) or, if you are debugging a dump file, verify that the file

                    mscordacwks_<arch>_<arch>_<version>.dll is on your symbol path.

                4) you are debugging on supported cross platform architecture as

                    the dump file. For example, an ARM dump file must be debugged

                    on an X86 or an ARM machine; an AMD64 dump file must be

                    debugged on an AMD64 machine.

     

    You can also run the debugger command .cordll to control the debugger's

    load of mscordacwks.dll.  .cordll -ve -u -l will do a verbose reload.

    If that succeeds, the SOS command should work on retry.

     

    If you are debugging a minidump, you need to make sure that your executable

    path is pointing to clr.dll as well.

     

    0:003> .cordll -ve -u -l

    CLRDLL: C:\Windows\Microsoft.NET\Framework\v4.0.30319\mscordacwks.dll:4.0.30319.18444 f:8

    doesn't match desired version 4.0.30319.34011 f:8

    CLRDLL: Unable to find mscordacwks_x86_x86_4.0.30319.34011.dll by mscorwks search

    CLRDLL: Unable to find 'mscordacwks_x86_x86_4.0.30319.34011.dll' on the path

    CLRDLL: Unable to get version info for 'c:\my\sym\cl\clr.dll\52968A96698000\mscordacwks_x86_x86_4.0.30319.34011.dll', Win32 error 0n87

    Cannot Automatically load SOS

    CLRDLL: ERROR: Unable to load DLL mscordacwks_x86_x86_4.0.30319.34011.dll, Win32 error 0n87

    CLR DLL status: ERROR: Unable to load DLL mscordacwks_x86_x86_4.0.30319.34011.dll, Win32 error 0n87

     

    0:003> .chain

    Extension DLL search Path:

        ...

    Extension DLL chain:

        C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos: image 4.0.30319.18444, API 1.0.0, built Wed Oct 30 14:40:34 2013

            [path: C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos.dll]

        pde.dll: image 9, 4, 0, 0, API 9.4.0, built Thu May 08 20:03:58 2014

            [path: c:\debuggers_x86\winext\pde.dll]

        dbghelp: image 6.3.9600.16384, API 6.3.6, built Wed Aug 21 20:59:03 2013

            [path: c:\debuggers_x86\dbghelp.dll]

        ext: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:11:11 2013

            [path: c:\debuggers_x86\winext\ext.dll]

        exts: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:14 2013

            [path: c:\debuggers_x86\WINXP\exts.dll]

        uext: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:09 2013

            [path: c:\debuggers_x86\winext\uext.dll]

        ntsdexts: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:34 2013

            [path: c:\debuggers_x86\WINXP\ntsdexts.dll]

     

    Summary

    As discussed in the previous article, the asynchronous and projected nature of Windows Store applications makes them significantly harder to debug than desktop applications. Stowed Exceptions v2 helps definitively determine the error code and call stack of the exception that caused the crash.

     

    Solutions to some of the more common issues have been talked about on episodes of Channel 9 Defrag Tools, and also in Avoiding Windows Store App Failures talk at //build/ 2014 and the Hardcore Debugging talk at TechEd 2014.

     

    If you have any questions, please feel free to email us at DefragTools@microsoft.com, we’ll be happy to help you.

  • Ntdebugging Blog

    Understanding ARM Assembly Part 2

    • 0 Comments

    My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 2 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  Here we are going to talk about how Windows utilizes that ARM processor.

     

    As we discussed in part 1 Windows runs on the ARMV7-A with NEON.  We discussed the CPSR register in part 1.  There are a few bits that are important in the CPSR.  The first one is the Endian State bit:

    31

    30

    29

    28

    27

    26

    25

    24

    23

    22

    21

    20

    19

    18

    17

    16

    15

    14

    13

    12

    11

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    0

    N

    Z

    C

    V

    Q

    IT

    J

    Reserved

    GE

    IT

    E

    A

    I

    F

    T

    M

     

    Bit 9 (the E bit) indicates the EndianState.  This bit should always be a 0 because Windows only runs in Little-Endian state.  So if you get a dump, and see the CPSR bit 9 is set then you have a problem.  Here is an example from the debugger:

    1: kd> r

    r0=00000001  r1=00000001  r2=00000000  r3=00000000  r4=e1074044  r5=c555b580

    r6=00000001  r7=e104ca39  r8=00000001  r9=00000000 r10=e9bf06c7 r11=d5f1ea08

    r12=e16b213c  sp=d5f1e9b0  lr=e0f0fe2f  pc=e0fdebd0 psr=00000133 ----- Thumb

    nt!DbgBreakPointWithStatus:

    e0fdebd0 defe     __debugbreak

     

    1: kd> .formats 00000133

    Evaluate expression:

      Hex:     00000133

      Decimal: 307

      Octal:   00000000463

      Binary:  00000000 00000000 00000001 00110011  ßBit 9 is 0.  Note first bit is Bit 0. 

      Chars:   ...3

      Time:    Wed Dec 31 18:05:07 1969

      Float:   low 4.30199e-043 high 0

      Double:  1.51678e-321

     

    So how could Bit 9 ever be a 1?  The SETEND instruction in the ARM ISA allows even user mode code to change the current endianness, doing so will be dangerous for an application and is discouraged.  If an exception is generated while in big-endian mode the behavior is unpredictable, but may lead to an application fault (user mode) or bugcheck (kernel mode).

     

    The next bit we are going to discuss is bit 5, the Thumb bit (the T bit).  This should be a 1 if executing Thumb instructions.  So let’s discuss the different instruction sets the ARM processor has.

     

    ARMv7 has four different ISA's for programming. 

    • ARM - basic ARM instruction set including conditional execution.
    • Thumb - This mode uses a 16 bit instruction encoding to reduce code footprint.  It has limitations with respect to register access and some system instructions aren't implemented for Thumb.
    • Thumb2 - This extension of the Thumb instruction set adds 32 bit opcode encodings and adds enough facilities to author an entire OS.  Support for Thumb2 is guaranteed in the ARMv7 architecture revision.
    • Jazelle - Java code interpretation.
    • ThumbEE - a limited version of Thumb2 intended as a code generation target for JIT scenarios.

     

    Windows requires Thumb2 support.  The advantage of using Thumb2 is that the combination of 16 and 32 bit opcodes along with some other ISA improvements allows for saving 20-30% code footprint at a 1-2% performance loss.  In addition the cache hit rate is improved due to increased density of the code.

     

    CPSR Bit 5 should always be 1 as Windows only runs in Thumb2 mode.  Also note that this bit is combined with bit 24, the Java state bit (the J bit).  Bit 24 should always be 0 when running Windows.

     

    The next bits to discuss are the CPU Mode bits 4-0 (M).  Windows only runs in two modes.  They are User Mode (10000) and Supervisor Mode (10011).  If Bits 4-0 are anything other than the indicated values given an exception will be raised.  Kernel will run in Supervisor Mode, and applications will run in User Mode.

     

    That brings up another point.  How does the processor switch between Supervisor Mode and User Mode?  It is called the SVC call.  In the x86 processor this was done via SYSENTER/SYSEXIT.  In x64 processor this was done via SYSCALL/SYSRET.  In ARM this is done via the SVC or Supervisor Call.  This call is made to have the kernel provide a service.  When invoked in ntdll.dll the service number is in r12.  Here is an example:

    1: kd> u ntdll!ZwQueryVolumeInformationFile

    771e8674    f04f0c8d    mov   r12,#0x8D
    771e8678    df01        svc   #1
    771e867a    4770        bx    lr

     

    When SVC is called the previous CPSR register is saved in the SPSR register (the Saved Program Status Register), and pc register is saved in lr register (the Link Register).  The processor then changes to kernel mode (0x13) with interrupts disabled.  The lr and SPSR values are used to generate a return from the SVC call.  When an exception is taken the stack is untouched, the previous mode's SP and LR are left alone, new modes SP becomes active, exception address is stored in the new mode's LR, and the previous CPSR is copied into the new mode's SPSR.  When returning from the exception the SPSR is copied back into the CPSR, and it returns to LR.

     

    Data Types

    ARMv7 processors support four data types from 8 bits to 64 bits, but the definitions are different than the ones in Windows.  In Windows 16 bits are defined as a word, on ARM a word is 32 bits.

    Byte

    8 bits

    HalfWord

    16 bits

    Word

    32 bits

    DoubleWord

    64 bits

     

    These can be signed or unsigned.

    • Unsigned 32 bit integer
    • Signed 32 bit integer
    • Unsigned 16 bit integer (zero extended)
    • Signed 16 bit register (sign extended)
    • Unsigned 8 bit integer (zero extended)
    • Signed 8 bit register (sign extended)
    • Two 16 bit integers
    • Four 8 bit integers
    • The upper or lower 32 bits of a 64 bit signed value whose other half is in another register
    • The upper or lower 32 bits of a 64 bit unsigned value whose other half is in another register

     

    Memory Model

    The ARM memory model is much like other architectures that we have supported.  ARM has a "weak ordering" memory model.  This means that two memory operations that occur in program order, may be observed from another processor or DMA controller in any order.  When an instruction stalls because it is waiting for the result of a preceding instruction, the core can continue executing subsequent instructions that do not need to wait for the unmet dependencies.  There are three instructions that allow you to configure memory barriers:

    • ISB - Instruction Synchronization Barrier
    • DMB - Data Memory Barrier
    • DSB - Data Synchronization Barrier

     

    An excellent blog article on this topic with an explanation of these three instructions is available at:

    http://blogs.arm.com/software-enablement/594-memory-access-ordering-part-3-memory-access-ordering-in-the-arm-architecture/

     

    Alignment and Atomicity

    Windows enables the ARM hardware to handle misaligned integer accesses transparently; however, there are still several situations where alignment faults may be generated on misaligned accesses. Follow the rules below:

    • Halfword and word-sized integer loads and stores do NOT need to be aligned (hardware will handle them efficiently and transparently)
    • Floating-point loads and stores SHOULD be aligned (the kernel will handle them transparently, but with significant overhead)
    • Load/store double (LDRD/STRD) and multiple (LDM/STM) operations SHOULD be aligned (the kernel will handle most of them transparently, but with significant overhead)
    • All uncached memory accesses MUST be aligned, even for integer accesses (you will get an alignment fault)

     

    Note that the memcpy() implementation provided by the Windows CRT presumes the copies are to/from cached memory, and thus leverages the hardware’s support for transparently handling misaligned integer reads and writes with little penalty. This means that memcpy() CANNOT be used when the source or destination is uncached memory. Instead, use the separate function _memcpy_strict_align(), which only performs aligned accesses.

     

    There are two types of atomicity supported.  Single-copy and Multi-copy.

     

    Single-copy atomicity

    There are rules around atomicity that are intended to specify the cases where memory access behavior in relation to program order can be guaranteed.  So certain access (aligned word accesses) are guaranteed by the architecture to return sensible results even if other threads are accessing the same memory.  These rules are necessary in order to guarantee that the programmer (and compiler) can rely on correct behavior of memory in the majority of the cases.

     

    Multi-copy atomicity

    These rules are similar, but relate specifically to multi-processing environments in which several observers may be using a particular item in memory.  To be able to guarantee correct behavior you need to be able to assume that memory behaves in a consistent way.

     

    More on Single-Copy and Multi-Copy atomicity in the ARM Architecture Reference Manual available from http://infocenter.arm.com/help/index.jsp.

     

    Common Assembly Instructions

    We are going to cover some common Thumb2 instructions.

    • ldr           r0, [r4]                  (ldrex, ldrh ldrb, ldrd, ldrexd, etc.)

      This is the Load Register instruction.  In the above example r0 is the destination register, and r4 is the base register.  This will take the address that is in r4, go to that memory location and copy the contents of that memory location into r0.

    • str           r2, [r4, #0x08]                    (strex, strh, strexh, strd, etc.)

      This is the Store Register instruction.  In the above example r2 is the source register, and r4 is the base register.  This will take the address in r4 and add 8 to that address.  It will take the value that is in r2, and store it at the address pointed to by r4 plus 8.

    • mov       r1, r4                                      (movs – sets the condition codes)

      This is the Move instruction.  In the above example r1 is the destination register, and r4 is the source register.  It will do the same thing as x86 in that it just copies what is in r4 to r1.  It can optionally updated the condition flags based on the value.

    • adds      r1, r5, #0                              (add)

      This is the Add instruction.  In the above example r1 is the destination register.  This will take the value that is in r5 and add 0 to it.  It will store the result in r1.  Because this has an (s) at the end of add it will update the flags.

    • sub         sp, sp, #0x14                      (subs)

      This is the Subtract instruction.  In the above example sp is the destination.  This will take the value that is in sp, subtract 14h from it, and store the result in sp. Because this does not have an (s) at the end it will not update the flags.

    • push      {r4-r9, r11, lr}

      This is the Push instruction.  It can push multiple registers to the stack in one instruction.  You can separate a full series of register with the beginning register "-" and ending register like seen above.  You can also list them all, and just separate them by ",".  This operates the same as an x86 processor in that it subtracts 4 from the stack pointer for each push.

    • pop        {r4-r9, r11, lr}

      This is the Pop instruction.  It pulls values from the stack back into the registers you list.  The registers work just like the push instruction.  This operates the same as an x86 processor in that it adds 4 to the stack pointer for each pop.

    • b??         |MyApp!main+0x60 (00b81348)|

      This is the Branch instruction.  This is equivalent to the jmp instruction in x86.  However it has several conditional variants such as "beq, bge, and etc.".

    • bx           r3

      This is the Branch and Exchange instruction.  This causes a branch to an address and instruction set specified by a register (r3 here).  This can do a long branch anywhere in the 32-bit address range.

    • bl            |MyApp!Function (00b815c4)|

      This is the Branch with Link instruction.  This calls a subroutine at a PC-relative address.  This will update the lr register.

    • blx          r3

      This is the Branch with Link and Exchange.  This calls a subroutine at an address and instruction set specified by a register (r3 here).  This will do a long branch anywhere in the 32-bit address range, and update the lr register.

    • dmb      

      This is the Data Memory Barrier instruction.  It is a memory barrier that ensures the ordering of observations of memory accesses.

    • cmp       r3, #0

      This is the Compare instruction.  It will subtract 0 from the value in r3, and set the flags accordingly. 

     

    In ARM addressing the base register points to memory being referenced.  The offset can be an immediate or an index register.  The memory stored at the base register`s address plus the offset is accessed.  The base register remains unchanged.  Example:

    Ldr r5,[r9,#0x1c]

     

    This will take the value that is in r9 and add 0x1C to it, go to that memory location, and retrieve the value there and store it in r5.  R9 will remain the same value.

     

    ARM also has some interesting thing about indexing.  They have Pre-Indexed addressing, Offset Addressing, and Post-Indexed Addressing.

     

    Pre-Indexed addressing the value of the base register is first modified by the offset then the memory pointed to by the modified base register is accessed.  Example:

    Str r2,[r4,#0x4]!

     

    The "!" at the end of the instruction is not a mistake.  This is how you tell it is a Pre-Indexed address. 

     

    Offset Addressing.  The value is added to the base register, and that is used as the address for memory access.  If the "!" was not there then this would just be Offset addressing.  Example:

    Str r2,[r4,#0x4]

     

    Post-Index addressing the memory address in the base register is accessed then afterwards the base register is modified by the offset value.  Example:

    Ldr pc,[sp],0x1c

     

    Notice the "!" is missing here.  Also notice the offset is outside the "[ ]".  That is how you can find a Post-Index.

     

    Part 3 of this series will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.

  • Ntdebugging Blog

    NTFS Misreports Free Space (Part 3)

    • 0 Comments

    It’s been a while since my last post on this topic, and I wanted to take some time to update everyone on a cool new feature in Windows Server 2012 R2 and Windows 8.1.  Today we declare part 1 and part 2 of this blog as obsolete - at least for Windows Server 2012 R2 and Windows 8.1 users.

     

    The latest fsutil.exe now allows for the creation of an allocation report which summarizes how all of your disk space is being used by NTFS.  This new fsutil.exe functionality is implemented though some new file system controls that only exist on Windows Server 2012 R2 and Windows 8.1, so the binary is not portable to previous versions of Windows.

     

    USAGE: fsutil volume allocationreport X:

    X: is the drive letter of an NTFS volume on your system.

     

    Allocation Report

    The allocation report gives a summary of total reserved, free, and allocated clusters.  Reserved clusters are clusters that NTFS reserves just in case it needs to allocate space for a critical operation (like expanding a compressed file or extending the $MFT).  If you’re experiencing insufficient disk space errors on a volume that has plenty of free space, the issue could be caused by opening many compressed NTFS files at the same time.  Please refer to Understanding Ntfs Compression for more information on how to troubleshoot this.

     

    Allocation report:
    Total clusters              : 244100351 (999835037696 bytes)
    Free clusters               : 232507563 (952350978048 bytes)
    Reserved clusters           : 18352 (75169792 bytes)
    Total allocated             : 47484059648 bytes

     

    System Files

    If you suspect that there’s something you can’t see that’s taking up disk space, check the System Files section to see how much disk space is used by the system.  In this example, I have 884,703,232 bytes in use by NTFS metadata, and the breakdown of each system file’s usage is outlined below.  For details on each system file type, refer to http://blogs.technet.com/b/askcore/archive/2009/12/30/ntfs-metafiles.aspx.

     

    System files                : Count: 29. Total allocated: 884703232 bytes.
        $Mft                    : File ID 0x0001000000000000. Total allocated: 238063616 bytes.
        $MftMirr                : File ID 0x0001000000000001. Total allocated: 4096 bytes.
        $LogFile                : File ID 0x0002000000000002. Total allocated: 67108864 bytes.
        $Volume                 : File ID 0x0003000000000003. Total allocated: 0 bytes.
        $AttrDef                : File ID 0x0004000000000004. Total allocated: 4096 bytes.
        Root folder             : File ID 0x0005000000000005. Total allocated: 8192 bytes.
        $Bitmap                 : File ID 0x0006000000000006. Total allocated: 30515200 bytes.
        $Boot                   : File ID 0x0007000000000007. Total allocated: 8192 bytes.
        $BadClus                : File ID 0x0008000000000008. Total allocated: 0 bytes.
        $Secure                 : File ID 0x0009000000000009. Total allocated: 1855488 bytes.
        $UpCase                 : File ID 0x000a00000000000a. Total allocated: 131072 bytes.
        $Extend                 : File ID 0x000b00000000000b. Total allocated: 0 bytes.
        $ObjId                  : File ID 0x0001000000000019. Total allocated: 24576 bytes.
        $Quota                  : File ID 0x0001000000000018. Total allocated: 0 bytes.
        $Reparse                : File ID 0x000100000000001a. Total allocated: 786432 bytes.
        $UsnJrnl                : File ID 0x0002000000012f66. Total allocated: 34144256 bytes.
        $RmMetadata             : File ID 0x000100000000001b. Total allocated: 0 bytes.
        $Repair                 : File ID 0x000100000000001c. Total allocated: 94371840 bytes.
        $Txf                    : File ID 0x000100000000001e. Total allocated: 4096 bytes.
        $TxfLog                 : File ID 0x000100000000001d. Total allocated: 4096 bytes.
        $Tops                   : File ID 0x000100000000001f. Total allocated: 396623872 bytes.
        $TxfLog.blf             : File ID 0x0001000000000020. Total allocated: 65536 bytes.
        Other system files      : Count: 4. Total allocated: 0 bytes.
        Other system files under $Txf folder:
            Count               : 1
            Total allocated     : 8192 bytes.
        Other system files under $TxfLog folder:
            Count               : 2
            Total allocated     : 20971520 bytes.

     

    System Volume Information 

    If the usage in System Volume Information is higher than expected, the issue is likely to be caused by storage of diff areas for VSS volume shadow copies.  Deleting the volume shadow copies with VSSAdmin or Diskshadow will return the free space.  System Volume Information is also the home of the chunk store used by NTFS deduplication.

     

    System Volume Information   : Total allocated: 5366915072 bytes.
        Files                   : Count: 18. Total allocated: 5366882304 bytes.
        Folders                 : Count: 7. Total allocated: 32768 bytes.

     

    User Folders

    It costs something to maintain the folder structure of a volume, and the user folders section summarizes the overall cost.  Within this section is also a summary of how many NTFS compressed folders exist.  As you can see below, I have 145 folders with a compressed attribute flag but the total number of compressed bytes is zero.  I puzzled over the idea of zero compressed bytes until I discovered that this measurement is of how many compressed bytes exist in the context of folder indexes, and indexes are never compressed.  Only user data streams are compressed natively by NTFS.

     

    User folders                : Count: 23101. Total allocated: 77889536 bytes.
        Default streams         : 4689
            Allocated           : 4689
            Total allocated     : 77885440 bytes.
        Named streams           : 7
            Allocated           : 0
            Total allocated     : 0 bytes.
        Local metadata streams  : 95566
            Allocated           : 1
            Total allocated     : 4096 bytes.
    Within these folders there are:
        Compressed              : 145
            Total allocated     : 0 bytes
            Total size          : 0 bytes.
            Savings             : 0.00 %
        Sparse                  : 0
            Total allocated     : 0 bytes
            Total size          : 0 bytes.
            Savings             : 0.00 %
        Encrypted               : 0
            Total allocated     : 0 bytes

        With named streams      : 7
            Compressed          : 0
            Sparse              : 0
            Encrypted           : 0
        With no allocation      : 18412

     

    User Files

    In the user files section, we have a total of all user files and the compression statistics to show how much space is being saved by native NTFS compression.  There is also a nice summary of alternate named stream usage (ANS).  ANS allocations do not show up in DIR or Explorer, so this is a quick and easy way to see exactly how your named streams are affecting overall disk usage.  On my volume, I had 3115 files with named streams and zero bytes were allocated.  This seems to be another paradox, but there’s a logical explanation for what’s happening.  If a file has a named stream and the stream size is small enough for it to be resident, then the stream lives in the file’s MFT record (which is accounted in this report as part of $Mft                    : File ID 0x0001000000000000. Total allocated: 238063616 bytes.).

     

    User files                  : Count: 94128. Total allocated: 41154551808 bytes.
        Default streams         : 94128
            Allocated           : 72123
            Total allocated     : 41087229952 bytes.
        Named streams           : 4637
            Allocated           : 4562
            Total allocated     : 66740224 bytes.
        Local metadata streams  : 333248
            Allocated           : 142
            Total allocated     : 581632 bytes.
    Within these files there are:
        Compressed              : 2006
            Total allocated     : 374972416 bytes
            Total size          : 816416626 bytes.
            Savings             : 54.07 %
        Sparse                  : 1519
            Total allocated     : 1572864 bytes
            Total size          : 273374082 bytes.
            Savings             : 99.42 %
        Encrypted               : 0
            Total allocated     : 0 bytes

     

        With named streams      : 3115
            Compressed          : 0
            Sparse              : 0
            Encrypted           : 0
        With no allocation      : 20485

     

    As you can see, this new functionality in fsutil makes it easier and faster to determine what is using space on an NTFS volume.

Page 1 of 1 (4 items)