• Ntdebugging Blog

    Disk Performance Internals



    My name is Ran Jiang. I am from the Platforms Global Escalation Services team in China. Storage is the slowest component of most computer systems. As such, storage is often a performance bottleneck. Today I want to discuss the disk performance kernel provider, partition manager.  By understanding how the disk performance provider works we can understand how disk performance is tracked internally in Windows and how disk related counters are calculated, which will be helpful for diagnosing storage performance issues.


    Disk Performance Architecture

    There are two sets of public interfaces to query performance counter data – PDH (Performance Data Helper) or the registry interface. The registry interface to the performance data is older than the PDH interface and has more extensive functionality. However, the PDH interface is easier to use for most performance data collection tasks. The PDH interface is essentially a higher-level abstraction of the functionality that the registry interface provides.

    Windows Disk Performance Architecture

    Windows performance monitor leverages the PDH interface to get performance data. The performance data helper (PDH) interface calls the registry interface.


    Perflib is one key component integrated in the registry interface, which is responsible for translating the request from the application and calling the collect procedure exported by a performance extension DLL. The extension DLL does the real work of data collection and returns a standard data format to perflib.


    Extension DLLs should expose Open, Collect, and Close functions to be called by perflib. We can find these functions’ name by checking the registry:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\<Service Name>\Performance

    Value Name: Close, Collect, Open


    Performance Registry


    Here is a user mode call stack when an application uses the registry API RegQueryValueEx() to collect performance data:

    0017f7ec 004e0000 perfdisk!CollectDiskObjectData+0xf8

    0017f964 7702eaa9 advapi32!QueryExtensibleData+0x577

    0017fd48 7702e962 advapi32!PerfRegQueryValue+0x5d8

    0017fe38 770576f5 advapi32!LocalBaseRegQueryValue+0x313

    0017fe9c 004011fc advapi32!RegQueryValueExW+0xa2

    0017fec8 00401153 getperfdata!GetPerformanceData+0x3c

    0017ff38 0040322a getperfdata!wmain+0x93

    0017ff88 773deccb getperfdata!__tmainCRTStartup+0x15e

    0017ff94 7798d24d kernel32!BaseThreadInitThunk+0xe

    0017ffd4 7798d45f ntdll!__RtlUserThreadStart+0x23

    0017ffec 00000000 ntdll!_RtlUserThreadStart+0x1b


    Disk performance kernel device stack

    Figure 2 shows the I/O manager stack to gather disk performance statistics. The volume manager underneath the file system driver gathers Logical Disk statistics. On Windows 2008 or above, volmgr.sys handles Logical Disk statistics for both dynamic and basic disks. The partition manager, partmgr.sys, gathers physical disk statistics.  These statistics are measured and collected for each request that passes through the I/O manager stack.

    Windows IO Manager

    Physical Disk Statistics

    Partition manager (partmgr.sys) saves performance information in the device extension’s counter context.


    Logical Disk statistics

    Volume manager (volmgr.sys) also saves performance metrics in its device extension.


    How to track disk performance?

    Performance information is tracked in the read and write dispatch routines and in the IO completion routines. There are 5 kinds of counter data tracked by partition manager:


    1. Queue depth - Total concurrent IOs still in process and not yet completed.

    2. Total counts of read and write requests.

    3. Total read and write time for all IO requests.  For example: There are total 2 write IOs completed since disk counter is enabled, one takes 1 sec and the other takes 2 sec. Then, this write counter will be 1 sec + 2 sec = 3 sec.

    4. Total Idle time.

    5. Total split IO (fragmented IO).


    Let’s talk about them separately.


    Queue depth:

    When a new IRP is sent to partition manager it will increment the queue depth. Partition manager will decrement the queue depth when completing an IRP. Therefore, the value indicates how many concurrent IOs are still in process.


    Total read and write count:

    When any read or write IO has been completed the partition manager IRP completion routine will get called. Then the read or write counter will be incremented. Note we only track completed IOs here.


    Total read and write time for IOs:

    When any read or write IO starts, partition manager’s dispatch routine will record the current time stamp in the IO stack location of the IRP. When an IRP is completed the completion routine will use this time stamp and the current system time to calculate the time taken to complete this IO.  Partition manager will then add this value to the appropriate counter in the device extension.


    Total Idle time:

    When completing an IRP and decrementing the queue depth partition manager will check if the queue depth reaches 0. If yes, it indicates the disk state has been transitioned from busy to idle. Then it will save the time stamp to Last Idle Clock in the counter context.


    When a new IRP is sent to partition manager it will increment the queue depth and will check if queue depth reaches 1. If yes, it indicates the disk state has been transitioned from idle to busy. Then Idle time counter will be increased by (current time stamp – Last Idle Clock).


    Total split IO (fragmented IO):

    When completing an IRP, partition manager will check if the IRP is marked as IRP_ASSOCIATED_IRP. This flag is usually set by the file system driver when a large IO is split into multiple smaller IOs. Typically, when an IO contains several runs and each run will contain continuous block of data, NTFS will create an associated IRP for each run and send this IRP to the lower level driver. Therefore, this counter usually can be used to track fragmented IOs.


    Note: Disk performance statistics are saved to an array whose index corresponds to each processor. Most of the counters are saved to the index corresponding to the processor the IRP was completed on.


    How to convert to performance counter?

    Now we understand how the kernel keeps tracking of these metrics. We need to map those metrics in kernel to the performance counter as shown in performance monitor. The counters visible in performance monitor are calculated based on the metrics from kernel. Each counter has a counter type and each counter type has a different calculation. The counter type determines how the counter data is calculated, averaged, and displayed.


    For example, Avg. Disk sec/Transfer has counter type of PERF_AVERAGE_TIMER. The formula of PERF_AVERAGE_TIMER is: ((N1 - N0) / F) / (D1 - D0), where the numerator (N) represents the number of ticks counted during the last sample interval, F represents the frequency of the ticks, and the denominator (D) represents the number of reads and writes completed during the last sample interval. N1 - N0 are returned from kernel as ReadTime + WriteTime in ticks. D1 and D0 are returned from partition manager or volume manager as read counts + write counts.


    Avg. Disk Transfer/sec:

    Counter type: PERF_COUNTER_COUNTER

    Formula: (N1- N0) / ( (D1-D0) / F), where N1- N0 are returned from partition manager or volume manager as read counts + write counts. D1-D0 are the number of ticks counted during the last sample interval. F represents the frequency of the ticks.


    Avg. Disk Queue Length:


    Formula: (N1 - N0) / (D1 - D0), where the numerator (N) represents queue depth and the denominator (D) represents the time elapsed during the sample interval.


    Current Disk Queue Length:


    Formula: None. Shows raw data as collected. It’s Instantaneous value of queue depth.


    Disk Bytes/sec:


    Formula: (N1 - N0) / ( (D1 - D0) / F, where the numerator (N) represents the total ReadBytes + WriteBytes, the denominator (D) represents the number of ticks elapsed during the last sample interval, and F is the frequency of the ticks.


    % Idle Time

    Counter type: PERF_PRECISION_100NS_TIMER

    Formula: NX – N0 / D1 – D0, where the numerator (N) represents the Total IdleTime and the denominator (D) is the value of the private timer. The private timer has the same frequency as the 100 ns timer.


    Note: Programmers should avoid calculating counters manually and should instead use pdh.dll.  An example of what can go wrong when calculating this data manually is described in Performance Monitor Averages, the Right Way and the Wrong Way.


    How to measure disk performance?

    In this section we are going to discuss which counters are the key to measuring disk performance. Generally we have 4 counters used for performance measurement: Disk Bytes/sec, % Idle Time, Disk sec/Transfer and Avg. Disk Queue Length.


    Disk Bytes/sec

    From the formula, Disk Bytes/sec is actually how many bytes have been completed in every second. There are two things could impact this counter value:


    1. How much stress is generated to the disk or volume?

    Let’s assume if there are no problems with disk performance and stress has not reached the storage bottleneck. Then, this counter value will be determined the stress IO load generated by the application such as a stress tool.


    2. Disk performance

    If the IO load has exceeded the storage bottleneck, this counter value will not be able to be increased with load increasing.


    Conclusion: Since this counter value could be affected by IO load from an application we cannot use it as the key to determine disk performance.


    % Idle Time

    This counter value indicates how long the disk is in idle status without outstanding IO. It can help to determine how busy the disk is. However, even if the disk is busy with 0% Idle Time, we cannot say it suffers from a performance issue as it may still be able to complete all IOs in time.


    Avg. Disk Queue Length

    This counter indicates on average how many IOs are outstanding. If the disk can always complete IO immediately, the value should be 0. Therefore, it’s also a value to determine how busy the storage is. But it does not impact the application directly as the application does not care how many total IOs are outstanding. The application is concerned with how fast every IO can be completed. In practice, if we see the queue depth is more than 10 we may say the storage is busy and could delay the IO in the queue. However if every IO can be completed fast there will be no impact to the application, which means the delay is still acceptable.


    Disk sec/Transfer

    This counter indicates how fast the IO is completed on average. This is one of the keys to an application’s performance as discussed above.


    Dynamic counter loading feature

    On Windows 2008 or above the disk counter in the kernel provider can be dynamically enabled or disabled. If there is no one open handle to HKEY_PERFORMANCE_DATA the kernel provider will disable IO performance trace by setting a flag in the device extension. Here is the Call stack when the counter is being dynamically disabled:


    8f14d970 volmgr!VmWmiFunctionControl

    8f14d9e0 WMILIB!WmiSystemControl+0x3b9

    8f14da00 volmgr!VmWmi+0x8d

    8f14da18 nt!IofCallDriver+0x63

    8f14da20 volsnap!VolSnapDefaultDispatch+0x2b

    8f14da38 nt!IofCallDriver+0x63

    8f14da60 nt!WmipForwardWmiIrp+0x18b

    8f14da8c nt!WmipSendWmiIrp+0x56

    8f14dabc nt!WmipDeliverWnodeToDS+0x22

    8f14dc28 nt!WmipSendEnableDisableRequest+0x10e

    8f14dc4c nt!WmipDoDisableRequest+0x26

    8f14dc64 nt!WmipDisableCollectOrEvent+0x35

    8f14dc8c nt!WmipDeleteMethod+0x25

    8f14dca8 nt!ObpRemoveObjectRoutine+0x13d

    8f14dcd0 nt!ObfDereferenceObject+0xa1

    8f14dd14 nt!ObpCloseHandleTableEntry+0x24e

    8f14dd44 nt!ObpCloseHandle+0x73

    8f14dd58 nt!NtClose+0x20

    8f14dd58 nt!KiFastCallEntry+0x12a

    0012fda4 ntdll!KiFastSystemCallRet

    0012fda8 ntdll!NtClose+0xc

    0012fde0 ADVAPI32!WmiCloseBlock+0x33

    0012fe58 ADVAPI32!PerfRegCloseKey+0x175

    0012fe68 ADVAPI32!BaseRegCloseKeyInternal+0x81

    0012fe7c ADVAPI32!ClosePredefinedHandle+0x7c

    0012feb8 ADVAPI32!RegCloseKey+0x67

    0012fed0 ReadTest!GetPerformanceData+0xe5

    0012ff38 ReadTest!wmain+0xae

    0012ff88 ReadTest!__tmainCRTStartup+0x15e

    0012ff94 kernel32!BaseThreadInitThunk+0xe

    0012ffd4 ntdll!__RtlUserThreadStart+0x23

    0012ffec ntdll!_RtlUserThreadStart+0x1b


    Since the sample app from MSDN tries to close the handle every time after calling RegQueryValueEx(), it will disable and enable the disk counter intermittently. The impact to any app using registry API will be that some IO is started with counter disabled with no time stamp recorded and later completed with counter enabled, then generate a huge time difference for such an IO and charge to sec/transfer. KB 2470949 was released to address this issue on Windows 2008 R2.



    Disk Subsystem Performance Analysis for Windows



    How to Calculate Your Disk I/O Requirements



    Disk Partition Alignment Best Practices for SQL Server



    Counter Types


  • Ntdebugging Blog

    Driver Object Corruption Triggers Bugcheck 109


    My name is Victor Mei, I am an Escalation Engineer in Platforms Global Escalation Services in GCR.  Some customers I worked with have strong interests in debugging; but usually they got frustrated when I told them “To find the cause from this dump, you have to get the code and understand the design behind it”.


    This time I am going to talk about one crash dump, on which we can use basic debugging commands and knowledge of the Windows kernel to find out the root cause:


    1: kd> !analyze -v


    *                                                                             *

    *                        Bugcheck Analysis                                    *

    *                                                                             *



    This bugcheck is generated when the kernel detects that critical kernel code or

    data have been corrupted. There are generally three causes for a corruption:

    1) A driver has inadvertently or deliberately modified critical kernel code

    or data. See http://www.microsoft.com/whdc/driver/kernel/64bitPatching.mspx

    2) A developer attempted to set a normal kernel breakpoint using a kernel

    debugger that was not attached when the system was booted. Normal breakpoints,

    "bp", can only be set if the debugger is attached at boot time. Hardware

    breakpoints, "ba", can be set at any time.

    3) A hardware corruption occurred, e.g. failing RAM holding kernel code or data.


    Arg1: a3a01f5a3763f650, Reserved

    Arg2: b3b72be089e32ceb, Reserved

    Arg3: ffffe001a2894a20, Failure type dependent information

    Arg4: 000000000000001c, Type of corrupted region, can be

    0   : A generic data region

    1   : Modification of a function or .pdata

    2   : A processor IDT

    3   : A processor GDT

    4   : Type 1 process list corruption

    5   : Type 2 process list corruption

    6   : Debug routine modification

    7   : Critical MSR modification

    8   : Object type

    9   : A processor IVT

    a   : Modification of a system service function

    b   : A generic session data region

    c   : Modification of a session function or .pdata

    d   : Modification of an import table

    e   : Modification of a session import table

    f   : Ps Win32 callout modification

    10  : Debug switch routine modification

    11  : IRP allocator modification

    12  : Driver call dispatcher modification

    13  : IRP completion dispatcher modification

    14  : IRP deallocator modification

    15  : A processor control register

    16  : Critical floating point control register modification

    17  : Local APIC modification

    18  : Kernel notification callout modification

    19  : Loaded module list modification

    1a  : Type 3 process list corruption

    1b  : Type 4 process list corruption

    1c  : Driver object corruption

    1d  : Executive callback object modification

    1e  : Modification of module padding

    1f  : Modification of a protected process

    20  : A generic data region

    21  : A page hash mismatch

    22  : A session page hash mismatch

    102 : Modification of win32k.sys


    The stack only contains one frame:


    # Child-SP          RetAddr           Call Site

    00 ffffd000`223721c8 00000000`00000000 nt!KeBugCheckEx


    You will get disappointed if you attempted to find out who called KeBugCheckEx from the stack, because you will find KeBugCheckEx is the only function address on the stack.


    Since there is nothing more on the stack, let’s take a close look at what WinDBG tells about Bugcheck parameters:


    Arg3: ffffe001a2894a20, Failure type dependent information

    Arg4: 000000000000001c, Type of corrupted region, can be

    1c  : Driver object corruption


    Tip: Always use the latest version of WinDBG, the older versions may not tell you 1c is for Driver Object corruption.


    Arg4 indicates this is driver object corruption, so the type dependent information provided by Arg3 should be the Driver object, right? Let’s check the object:


    1: kd> !drvobj ffffe001a2894a20

    Driver object (ffffe001a2894a20) is for:

    ffffe001a2894a20: is not a driver object


    Let’s try !pool


    1: kd> !pool ffffe001a2894a20

    Pool page ffffe001a2894a20 region is Nonpaged pool

    ffffe001a2894000 size:  510 previous size:    0  (Allocated)  FMcr

    ffffe001a2894510 size:   50 previous size:  510  (Allocated)  Wmip

    ffffe001a2894560 size:   60 previous size:   50  (Allocated)  NtfJ

    ffffe001a28945c0 size:   60 previous size:   60  (Allocated)  EtwR

    ffffe001a2894620 size:   60 previous size:   60  (Allocated)  EtwR

    ffffe001a2894680 size:   60 previous size:   60  (Allocated)  EtwR

    ffffe001a28946e0 size:   60 previous size:   60  (Allocated)  EtwR

    ffffe001a2894740 size:  210 previous size:   60  (Allocated)  Devi

    *ffffe001a2894950 size:  200 previous size:  210  (Allocated) *Driv

         Pooltag Driv : Driver objects

    ffffe001a2894b50 size:  2b0 previous size:  200  (Allocated)  Devi

    ffffe001a2894e00 size:  200 previous size:  2b0  (Allocated)  Driv


    So the address does belong to a driver object, but what is the base address of NT!_Driver_Object? If you don’t have experience on it, a quick method is to refer to a known device object, for example:


    1: kd> !drvobj \driver\acpi

    Driver object (ffffe001a14df060) is for:


    1: kd> !pool ffffe001a14df060

    Pool page ffffe001a14df060 region is Nonpaged pool

    *ffffe001a14df000 size:  200 previous size:    0  (Allocated) *Driv

         Pooltag Driv : Driver objects

    ffffe001a14df200 size:   10 previous size:  200  (Free)       Free

    1: kd> ?ffffe001a14df060-ffffe001a14df000

    Evaluate expression: 96 = 00000000`00000060


    So, looks like the offset is 0x60, let’s have another try:


    1: kd> !drvobj ffffe001a2894950+0x60

    Driver object (ffffe001a28949b0) is for:



    Great, we got the object.


    Arg3 is ffffe001a2894a20, offset 0x70 to the Driver Object.


    1: kd> ?ffffe001a2894a20-ffffe001a28949b0

    Evaluate expression: 112 = 00000000`00000070


    1: kd> dt nt!_DRIVER_OBJECT ffffe001a28949b0

       +0x000 Type             : 0n4

       +0x002 Size             : 0n336

       +0x008 DeviceObject     : 0xffffe001`a144c030 _DEVICE_OBJECT

       +0x010 Flags            : 0x92

       +0x018 DriverStart      : 0xfffff800`0d044000 Void

       +0x020 DriverSize       : 0x1f6000

       +0x028 DriverSection    : 0xffffe001`a142e2c0 Void

       +0x030 DriverExtension  : 0xffffe001`a2894b00 _DRIVER_EXTENSION

       +0x038 DriverName       : _UNICODE_STRING "\FileSystem\Ntfs"

       +0x048 HardwareDatabase : 0xfffff802`64b31580 _UNICODE_STRING "\REGISTRY\MACHINE\HARDWARE\DESCRIPTION\SYSTEM"

       +0x050 FastIoDispatch   : 0xfffff800`0d0ae640 _FAST_IO_DISPATCH

       +0x058 DriverInit       : 0xfffff800`0d06e280     long  Ntfs!GsDriverEntry+0

       +0x060 DriverStartIo    : (null)

       +0x068 DriverUnload     : 0xfffff800`0c8d5d24     void  +0

       +0x070 MajorFunction    : [28] 0xfffff800`0d126a10     long  Ntfs!NtfsFsdCreate+0


    The bugcheck code seems to be indicating that the MajorFunction table is corrupted, let’s look at the details:


    1: kd> !drvobj ffffe001a2894950+0x60 f

    Driver object (ffffe001a28949b0) is for:


    Driver Extension List: (id , addr)


    Device Object list:

    ffffe001a144c030  ffffe001a1449030  ffffe001a144f030  ffffe001a28947a0


    DriverEntry:   fffff8000d06e280  Ntfs!GsDriverEntry

    DriverStartIo: 00000000  

    DriverUnload:  fffff8000c8d5d24  vicm

    AddDevice:     00000000  


    Dispatch routines:

    [00] IRP_MJ_CREATE                      fffff8000d126a10 Ntfs!NtfsFsdCreate

    [01] IRP_MJ_CREATE_NAMED_PIPE           fffff802645809ac nt!IopInvalidDeviceRequest

    [02] IRP_MJ_CLOSE                       fffff8000d10b390 Ntfs!NtfsFsdClose

    [03] IRP_MJ_READ                        fffff8000d061590 Ntfs!NtfsFsdRead

    [04] IRP_MJ_WRITE                       fffff8000d05c3d0 Ntfs!NtfsFsdWrite

    [05] IRP_MJ_QUERY_INFORMATION           fffff8000d133ca4 Ntfs!NtfsFsdDispatchWait

    [06] IRP_MJ_SET_INFORMATION             fffff8000d130290 Ntfs!NtfsFsdSetInformation

    [07] IRP_MJ_QUERY_EA                    fffff8000d133ca4 Ntfs!NtfsFsdDispatchWait

    [08] IRP_MJ_SET_EA                      fffff8000d133ca4 Ntfs!NtfsFsdDispatchWait

    [09] IRP_MJ_FLUSH_BUFFERS               fffff8000d0e9e94 Ntfs!NtfsFsdFlushBuffers

    [0a] IRP_MJ_QUERY_VOLUME_INFORMATION    fffff8000d1356b0 Ntfs!NtfsFsdDispatch

    [0b] IRP_MJ_SET_VOLUME_INFORMATION      fffff8000d1356b0 Ntfs!NtfsFsdDispatch

    [0c] IRP_MJ_DIRECTORY_CONTROL           fffff8000d12d2f0 Ntfs!NtfsFsdDirectoryControl

    [0d] IRP_MJ_FILE_SYSTEM_CONTROL         fffff8000d131898 Ntfs!NtfsFsdFileSystemControl

    [0e] IRP_MJ_DEVICE_CONTROL              fffff8000d0ed194 Ntfs!NtfsFsdDeviceControl

    [0f] IRP_MJ_INTERNAL_DEVICE_CONTROL     fffff802645809ac nt!IopInvalidDeviceRequest

    [10] IRP_MJ_SHUTDOWN                    fffff8000d1eb730 Ntfs!NtfsFsdShutdown

    [11] IRP_MJ_LOCK_CONTROL                fffff8000d046230 Ntfs!NtfsFsdLockControl

    [12] IRP_MJ_CLEANUP                     fffff8000d12bde0 Ntfs!NtfsFsdCleanup

    [13] IRP_MJ_CREATE_MAILSLOT             fffff802645809ac nt!IopInvalidDeviceRequest

    [14] IRP_MJ_QUERY_SECURITY              fffff8000d1356b0 Ntfs!NtfsFsdDispatch

    [15] IRP_MJ_SET_SECURITY                fffff8000d1356b0 Ntfs!NtfsFsdDispatch

    [16] IRP_MJ_POWER                       fffff802645809ac nt!IopInvalidDeviceRequest

    [17] IRP_MJ_SYSTEM_CONTROL              fffff802645809ac nt!IopInvalidDeviceRequest

    [18] IRP_MJ_DEVICE_CHANGE               fffff802645809ac nt!IopInvalidDeviceRequest

    [19] IRP_MJ_QUERY_QUOTA                 fffff8000d133ca4 Ntfs!NtfsFsdDispatchWait

    [1a] IRP_MJ_SET_QUOTA                   fffff8000d133ca4 Ntfs!NtfsFsdDispatchWait

    [1b] IRP_MJ_PNP                         fffff8000d158bac Ntfs!NtfsFsdPnp


    Fast I/O routines:

    FastIoCheckIfPossible                   fffff8000d1d4090 Ntfs!NtfsFastIoCheckIfPossible

    FastIoRead                              fffff8000d0f98e0 Ntfs!NtfsCopyReadA

    FastIoWrite                             fffff8000d12f160 Ntfs!NtfsCopyWriteA

    FastIoQueryBasicInfo                    fffff8000d1390c0 Ntfs!NtfsFastQueryBasicInfo

    FastIoQueryStandardInfo                 fffff8000d123bb0 Ntfs!NtfsFastQueryStdInfo

    FastIoLock                              fffff8000d0dd54c Ntfs!NtfsFastLock

    FastIoUnlockSingle                      fffff8000d0dd848 Ntfs!NtfsFastUnlockSingle

    FastIoUnlockAll                         fffff8000d1d3330 Ntfs!NtfsFastUnlockAll

    FastIoUnlockAllByKey                    fffff8000d1d35ac Ntfs!NtfsFastUnlockAllByKey

    ReleaseFileForNtCreateSection           fffff8000d062814 Ntfs!NtfsReleaseForCreateSection

    FastIoQueryNetworkOpenInfo              fffff8000d0f051c Ntfs!NtfsFastQueryNetworkOpenInfo

    AcquireForModWrite                      fffff8000d04b6d8 Ntfs!NtfsAcquireFileForModWrite

    MdlRead                                 fffff8000d0eb2c0 Ntfs!NtfsMdlReadA

    MdlReadComplete                         fffff80264588594 nt!FsRtlMdlReadCompleteDev

    PrepareMdlWrite                         fffff8000d0eb574 Ntfs!NtfsPrepareMdlWriteA

    MdlWriteComplete                        fffff802649289c8 nt!FsRtlMdlWriteCompleteDev

    FastIoQueryOpen                         ffffe001a17d4540 +0xffffe001a17d4540

    ReleaseForModWrite                      fffff8000d04b4d4 Ntfs!NtfsReleaseFileForModWrite

    AcquireForCcFlush                       fffff8000d06656c Ntfs!NtfsAcquireFileForCcFlush

    ReleaseForCcFlush                       fffff8000d066524 Ntfs!NtfsReleaseFileForCcFlush


    We found two potential issues here: DriverUnload and FastIoQueryOpen.


    Use FastIoQueryOpen as an example:


    1: kd> u ffffe001a17d4540

    ffffe001`a17d4540 4d8bc8          mov     r9,r8

    ffffe001`a17d4543 4c8bc2          mov     r8,rdx

    ffffe001`a17d4546 488bd1          mov     rdx,rcx

    ffffe001`a17d4549 48b900407da101e0ffff mov rcx,0FFFFE001A17D4000h

    ffffe001`a17d4553 48b83c57910c00f8ffff mov rax,offset vicm+0x6973c (fffff800`0c91573c)

    ffffe001`a17d455d ffe0            jmp     rax


    1: kd> u fffff800`0c91573c


    fffff800`0c91573c 48895c2408      mov     qword ptr [rsp+8],rbx

    fffff800`0c915741 48896c2410      mov     qword ptr [rsp+10h],rbp

    fffff800`0c915746 4889742418      mov     qword ptr [rsp+18h],rsi

    fffff800`0c91574b 57              push    rdi

    fffff800`0c91574c 4883ec20        sub     rsp,20h


    Obviously, FastIoQueryOpen has been modified to execute code in the module vicm.sys.  DriverUnload has been modified in a similar manner.


    Follow the description from “!analyze “1) A driver has inadvertently or deliberately modified critical kernel code or data. See http://www.microsoft.com/whdc/driver/kernel/64bitPatching.mspx”.  Kernel patch protection does not allow the MajorFunction table of certain drivers to be modified, if this data is modified the system will bugcheck as seen here.  It is time to remove the vicm.sys driver. The result is positive, the machine no longer crashes.

  • Ntdebugging Blog

    How to identify a driver that calls a Windows API leading to a pool leak on behalf of NT Kernel?


    Hello my name is Gurpreet Singh Jutla and I would like to share information on how we can trace the caller which ends up allocating “Se  “ Pool tag. When we use the Windows debugger and investigate the pool allocation and the binary associated with this pool tag, we see NT Kernel responsible for the allocations. But is the NT Kernel really responsible for a pool leak associated with this pool tag?


    Issue at hand

    • On windows 2003 x86 we see that the paged pool has depleted and we are running into event id 333.
    • We can see the same behavior on later versions of the OS and even on x64
    • We see that the leaking pool tag is “Se  “ which is the pool tag for security objects. Is Microsoft component at fault or something is calling an API and using security objects on a large scale?
    • This is windows 2003 x86 and we have limited options to root cause the issue. We need to really find out why we end up having so many allocations for this tag.
    • We could enable Pool Tracking on the NT Kernel, but would that help?


    Step 1

    !vm 1  -> This tells us if there were any page pool allocation failures


    0: kd> !vm 1


    *** Virtual Memory Usage ***

          Physical Memory:     2096922 (   8387688 Kb)

          Page File: \??\D:\pagefile.sys

            Current:  16779264 Kb  Free Space:  16552492 Kb

            Minimum:  16779264 Kb  Maximum:     16779264 Kb

          Available Pages:     1607242 (   6428968 Kb)

          ResAvail Pages:      1991659 (   7966636 Kb)

          Locked IO Pages:         656 (      2624 Kb)

          Free System PTEs:     163671 (    654684 Kb)

          Free NP PTEs:          32766 (    131064 Kb)

          Free Special NP:           0 (         0 Kb)

          Modified Pages:        10775 (     43100 Kb)

          Modified PF Pages:     10728 (     42912 Kb)

          NonPagedPool Usage:     7881 (     31524 Kb)

          NonPagedPool Max:      65279 (    261116 Kb)

          PagedPool 0 Usage:     67074 (    268296 Kb)

          PagedPool 1 Usage:      3266 (     13064 Kb)

          PagedPool 2 Usage:      3282 (     13128 Kb)

          PagedPool 3 Usage:      3268 (     13072 Kb)

          PagedPool 4 Usage:      3214 (     12856 Kb)

          PagedPool Usage:       80104 (    320416 Kb)

          PagedPool Maximum:    134144 (    536576 Kb)

          Session Commit:        14832 (     59328 Kb)

          Shared Commit:         19969 (     79876 Kb)

          Special Pool:              0 (         0 Kb)

          Shared Process:        19362 (     77448 Kb)

          Pages For MDLs:          146 (       584 Kb)

          PagedPool Commit:      80140 (    320560 Kb)

          Driver Commit:          1602 (      6408 Kb)

          Committed pages:      520485 (   2081940 Kb)

          Commit limit:        6241313 (  24965252 Kb)


    !poolused /t10 4  -> This will list the top consumers of paged pool. It is “Se  “ in our case.


    0: kd> !poolused /t10 4


    Sorting by Paged Pool Consumed


                   NonPaged                  Paged

    Tag     Allocs         Used     Allocs         Used


    Se           0            0     172204    232720312  General security allocations , Binary: nt!se

    MmSt         0            0      15231     31835696  Mm section object prototype ptes , Binary: nt!mm

    Ntff         9         1872      10434      8514144  FCB_DATA , Binary: ntfs.sys

    WD         384      1251376         27      5591040  UNKNOWN pooltag 'WD  ', please update pooltag.txt

    UlHT         0            0          1      4198400  Hash Table , Binary: http.sys

    NtfF         0            0       3259      3050424  FCB_INDEX , Binary: ntfs.sys

    Toke         0            0        966      3034216  Token objects , Binary: nt!se

    NtFs     13717       551176      19678      1758008  StrucSup.c , Binary: ntfs.sys

    IoNm         0            0      12034      1737360  Io parsing names , Binary: nt!io

    FSim         0            0      11336      1451008  File System Run Time Mcb Initial Mapping

    CM16         0            0        293      1437696  Internal Configuration manager allocations

    Wmit         6        11392         23      1376912  Wmi Trace

    NtFU         0            0       8719      1237232  usnsup.c , Binary: ntfs.sys

    Obtb         0            0        397       995600  object tables via EX handle.c , Binary: nt!ob

    Key          0            0       7245       753432  Key objects

    MmSm         0            0      11393       729152  segments used to map data files , Binary: nt!mm


    TOTAL    192495     33751032     347461    310172416


    Step 2

    Once we know the pool tag we need to run the following command. Remember the pool tag has to be a 4 char case sensitive string so I would run the following command which tells me in which all modules we have the Pool tag “Se  “ used. Please note that I have added two spaces in the string during my search. This is because the !poolused shows Se tag has caused the issue. Not providing the spaces in the search string will give different results. The following command searches each module that is loaded on the system.


    !for_each_module s -a @#Base @#End "Se  "


    The result should be something like:


    0: kd> !for_each_module s -a @#Base @#End "Se  "

    8096a1ae  53 65 20 20 6a 0c 6a 01-e8 bd a9 f2 ff 3b c7 89  Se  j.j......;..

    8096a5e3  53 65 20 20 6a 0c 6a 01-e8 88 a5 f2 ff 3b c7 89  Se  j.j......;..

    8096af9e  53 65 20 20 50 6a 01 89-45 fc e8 cb 9b f2 ff 8b  Se  Pj..E.......

    8096c909  53 65 20 20 8d 1c 9d 10-00 00 00 53 6a 01 e8 5c  Se  .......Sj..\

    8096c9e3  53 65 20 20 53 6a 01 e8-89 81 f2 ff 8b f8 85 ff  Se  Sj..........

    8096ca3e  53 65 20 20 8d 04 85 10-00 00 00 50 6a 01 e8 27  Se  .......Pj..'

    8096caf2  53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 73  Se  .......Sj..s

    8096cb3e  53 65 20 20 8d 1c 9d 0c-00 00 00 53 6a 01 e8 27  Se  .......Sj..'

    8096cb9d  53 65 20 20 56 6a 01 e8-cf 7f f2 ff 8b f0 3b f7  Se  Vj........;.

    8096cc44  53 65 20 20 6a 10 6a 01-e8 27 7f f2 ff 85 c0 0f  Se  j.j..'......

    8096cc65  53 65 20 20 6a 04 6a 01-e8 06 7f f2 ff 85 c0 0f  Se  j.j.........

    8096cc96  53 65 20 20 6a 04 6a 01-e8 d5 7e f2 ff 85 c0 0f  Se  j.j...~.....

    8096ccbe  53 65 20 20 6a 38 6a 01-e8 ad 7e f2 ff 8b f8 85  Se  j8j...~.....

    809718ac  53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8  Se  tB.Fh;.t.SP.

    80971b48  53 65 20 20 74 42 8b 46-68 3b c3 74 0f 53 50 e8  Se  tB.Fh;.t.SP.

    8097372b  53 65 20 20 6a 0c 6a 01-e8 40 14 f2 ff 85 c0 75  Se  j.j..@.....u

    80976c38  53 65 20 20 6a 0c 6a 01-e8 33 df f1 ff 3b c6 89  Se  j.j..3...;..

    80a20d1b  53 65 20 20 bf 00 01 00-00 57 56 e8 4d 3e e7 ff  Se  .....WV.M>..

    80a22698  53 65 20 20 c7 45 80 4f-62 43 6c c7 45 84 43 63  Se  .E.ObCl.E.Cc


    The above steps are also explained in detail at http://blogs.msdn.com/b/ntdebugging/archive/2012/08/31/troubleshooting-pool-leaks-part-3-debugging.aspx


    Step 3

    Now the tough part begins. We need to run “ln” command on each of the addresses shown by the output above. Example


    0: kd> ln 8096c909 

    (8096c8cc)   nt!SeQueryInformationToken+0x3d   |  (8096cdc0)   nt!SeCaptureObjectTypeList


    See the highlighted above. It is the nearest API to that address. Search on MSDN for all API’s listed. The ones that do not have an MSDN article can likely be ignored. SEQueryInformationToken is publically available and hence callable by any loaded driver. Example http://msdn.microsoft.com/en-in/library/windows/hardware/ff556690(v=vs.85).aspx


    At this point we will make the assumption that it is a 3rd party driver making the calls to this Windows API.  It is, of course, possible that a Microsoft module is doing this, but if such a problem exists in our code it usually becomes apparent very quickly when we get flooded by support calls.


    Step 4

    Method 1

    We can use the “!for_each_module” debugger command to search the address space of each loaded module for the entry point of the function found in step 3.


    0: kd> !for_each_module s-d @#Base @#End 8096c8cc

    b94b5938  8096c8cc 8096000a 80960182 80894b78  ............xK..

    ba7d50f4  8096c8cc 8081e5e0 8096b42e 8096b40c  ................

    ba83624c  8096c8cc 8081757e 8082b4f2 8094d438  ....~u......8...

    bab8103c  8096c8cc 80960122 80960158 8096010a  ...."...X.......

    f74e2950  8096c8cc 8089e180 8096091a 80960638  ............8...

    f74f0b68  8096c8cc 8089e180 80959546 80884150  ........F...PA..

    f75cf094  8096c8cc 80960122 80960158 8096010a  ...."...X.......

    f797c110  8096c8cc 808847f0 80966c60 80894b78  .....G..`l..xK..

    f7b643c4  8096c8cc 80960054 809e437c 808eadc6  ....T...|C......


    The first column in the above output can then be passed to the “ln” command to get the nearest function to the address listed.


    0: kd> ln b94b5938

    *** ERROR: Module load completed but symbols could not be loaded for ABCMiniFilter.sys

    0: kd> ln f797c110

    *** ERROR: Module load completed but symbols could not be loaded for XVhdBusxxx.sys


    Method 2

    This is a bit more complicated. Run the lm command to dump all modules. You will get the output as follows


    0: kd> lm

    start    end        module name

    80800000 80a5d000   nt        

    80a5d000 80a89000   hal       

    b476c000 b4775000   asyncmac   (deferred)            

    b7f55000 b7fb6000   ABCD   (deferred)            

    b817b000 b81da000   eeCtrl     (deferred)            

    b82b6000 b82b9300   xyz   (deferred)            

    b87a2000 b87cc000   Fastfat   (deferred)

    b8f04000 b8f0de80   ABCMiniFilter   (deferred)                       



    Ignore the unloaded modules and the modules you know are Microsoft binaries. For the rest of them you need to run the following command on each module. Remember the first column is the start of the module and the second column is the end of module.


    Use the !dh command for each header one by one and you will see the header information for the module.  I just picked one module for reference you will need to run the command for each third party module.


    0: kd> !dh b8f04000 -f




         14C machine (i386)

           7 number of sections

    51F7D252 time date stamp Tue Jul 30 20:18:50 2013


           0 file pointer to symbol table

           0 number of symbols

          E0 size of optional header

         102 characteristics


               32 bit word machine



         10B magic #

        9.00 linker version

        8800 size of code

        1200 size of initialized data

           0 size of uninitialized data

        86B2 address of entry point

         480 base of code

             ----- new -----

    00010000 image base

          80 section alignment

          80 file alignment

           1 subsystem (Native)

        6.01 operating system version

        6.01 image version

        5.01 subsystem version

        9E80 size of image

         480 size of headers

       11E27 checksum

    00040000 size of stack reserve

    00001000 size of stack commit

    00100000 size of heap reserve

    00001000 size of heap commit

           0  DLL characteristics

           0 [       0] address [size] of Export Directory

        8A28 [      50] address [size] of Import Directory

        9180 [     498] address [size] of Resource Directory

           0 [       0] address [size] of Exception Directory

        9E80 [    1FD0] address [size] of Security Directory

        9680 [     6C0] address [size] of Base Relocation Directory

        7990 [      1C] address [size] of Debug Directory

           0 [       0] address [size] of Description Directory

           0 [       0] address [size] of Special Directory

           0 [       0] address [size] of Thread Storage Directory

        7A68 [      40] address [size] of Load Configuration Directory

           0 [       0] address [size] of Bound Import Directory

        7880 [     108] address [size] of Import Address Table Directory

           0 [       0] address [size] of Delay Import Directory

           0 [       0] address [size] of COR20 Header Directory

           0 [       0] address [size] of Reserved Directory


    The highlighted line is important for us. Remember this is the offset of the start of Import Address Table directory and the end of the same. In our case 7880 offset is start and 108 offset is the end. We run the following command to dump the import table. Running the dps command from the start offset in the module till the end of the import table will list all the functions it imports


    0: kd> dps b8f04000+7880 b8f04000+7880+108

    b8f0b880  f7873892 fltMgr!FltGetStreamHandleContext

    b8f0b884  f7873988 fltMgr!FltAllocateContext

    b8f0b888  f787459c fltMgr!FltSetStreamHandleContext


    b8f0b938  8096c8cc nt!SeQueryInformationToken

    b8f0b93c  8096000a nt!RtlValidSid



    You see the module we searched in will have the API in its import table. Running the above command on each module, dumping the import tables will help us identify which modules have the API in its import table. However it cannot be confirmed if the driver calls the API and causes the leak. This is really tedious and time consuming but will help us identify each binary that calls the above API which leads to pool tag “Se  “ allocation on behalf of the NT Kernel.



    We can disable the binaries that call the SEQueryInformationToken API one by one and see if the issue still persists. Please note that the boot/System drivers should not be disabled as it can lead to a no boot situation. You can use Autoruns from Sysinternals to disable these drivers and if you run into a no boot situation, boot back into the “last known good configuration” to help recover the box.

  • Ntdebugging Blog

    We Are Hiring Windows Escalation Engineers in Charlotte, Dallas, and Redmond


    Would you like to join the world’s best and most elite debuggers to enable the success of Microsoft solutions?


    As a trusted advisor to our top customers you will be working with to the most experienced IT professionals and developers in the industry. You will influence our product teams in sustained engineering efforts to drive improvements in our products.


    This role involves deep analysis of product source code and debugging to solve problems in multi-million dollar configurations and will give you an opportunity to stretch your critical thinking skills. During the course of debugging, you will uncover opportunities to improve the customer experience while influencing the current and future design of our products.


    In addition to providing support to customers while being the primary interface to our sustained engineering teams, you will also have the opportunity to work with new technologies and unreleased software. Through our continuous investment in depth training and hands-on experience with tough customer challenges you will become the world’s best in this area. Expect to partner with many various roles at Microsoft launching a very successful career!


    We have positions open at our sites in Charlotte, NC; Las Colinas, TX; and Redmond, WA.


    Learn more about what an Escalation Engineer does at:

    Profile: Ron Stock, CTS Escalation Engineer - Microsoft Customer Service & Support - What is CSS?

    Microsoft JobsBlog JobCast with Escalation Engineer Jeff Dailey

    Microsoft JobsBlog JobCast with Escalation Engineer Scott Oseychik


    Apply here:


  • Ntdebugging Blog

    Windows Troubleshooting – Stop 9E Explained


    What to do if a stop 9E occurs.  How you can solve the issue yourself.

  • Ntdebugging Blog

    Windows Troubleshooting – Special Pool


    The Windows Support team has a new YouTube channel, “Windows Troubleshooting”.  The first set of videos cover debugging blue screens.

    In this video, Bob Golding, Senior Escalation Engineer, describes how the Special Pool Windows diagnostics tool catches drivers that corrupt memory. Bob also introduces how memory is organized in the system for allocating memory for drivers.

  • Ntdebugging Blog

    Bugchecking a Computer on A Usermode Application Crash


    Hello my name is Gurpreet Singh Jutla and I would like to share information on how we can bugcheck a box on any usermode application crash. Set the application as a critical process when the application crash is reproducible. We may sometimes need a complete memory dump to investigate the information from kernel mode on a usermode application crash or closure.


    We will use the operating system’s ability to mark a process as critical and cause the system to bugcheck when the critical process closes unexpectedly. This will generate either a CRITICAL_PROCESS_DIED or a CRITICAL_OBJECT_TERMINATION bugcheck.


    For this demonstration I will use the following code sample which waits for the user input and then causes an Access Violation. You can use the following steps to collect a complete memory dump for any application crash that launches fine but crashes under known repro conditions.


    Code Sample

    void main()
          _getch();      //Wait for a key press
          *(char*)0xdeaddead ='B';      //Causes the Access Violation


    Please follow the steps below

    1. Set the system for a complete memory dump by opening the “Advanced System settings” under System properties in control panel and then setting the value of “Write debugging information” under “Startup and recovery” options on the advanced tab.

    2. Also enable the debug mode by running the following command from a command prompt
      bcdedit -debug on
    3. To enable the “Complete memory dump” and debug mode you need to restart the box to ensure the changes are implemented.
    4. Run the application you want to setup as critical process but do not run the repro steps. I have compiled my test application as test.exe
    5. Download and install the Debugging Tools for Windows, part of SDK which you can download from http://msdn.microsoft.com/en-us/windows/desktop/bg162891.aspx. Note, when the installer launches you can uncheck every feature except Debugging Tools for Windows.
    6. We need to setup the debugger to use the public symbols. Create a folder c:\symbols. Run Windbg with admin privileges, choose “File” menu and then “Symbol file path”. Type SRV*c:\symbols*http://msdl.microsoft.com/download/symbols
      For more details check http://support.microsoft.com/kb/311503/en-us
    7. Assuming you have the debugger installed and setup with the public symbols, launch the debugger with admin privileges.
    8. From the file menu select kernel debug and then choose the “Local” tab and hit Ok button. This will connect the windbg to the local kernel. You should see an “lkd>” prompt.
    9. Run the following command to get the process information in windbg. The below example uses both x64 and x86 architectures

      0: kd> !process 0 0 test.exe

      PROCESS fffffa82fa924b30

          SessionId: 0  Cid: 036c    Peb: 7fffffda000  ParentCid: 02e4

         DirBase: 1085d76000  ObjectTable: fffff8a0042d7970  HandleCount: 11.

          Image: test.exe

      0: kd> !process 0 0 test.exe

      PROCESS 89038a08  SessionId: 0  Cid: 10f0    Peb: 7ffde000  ParentCid: 0f10

          DirBase: bfa19900  ObjectTable: e669b630  HandleCount: 11.

          Image: test.exe


    11. Take the process id from the output and run the following command. The following command shows the process flags. The output shows the flags as 144d0841 in the example for x64 and 0x44082d for x86.

      0: kd> dt nt!_eprocess fffffa82fa924b30 flags

         +0x440 Flags : 0x144d0801

      0: kd> dt 89038a08 nt!_eprocess flags

         +0x240 Flags : 0x450801


    13. Run the ed command to edit the memory and set the process flags to mark the process critical. Adding the value 0x2000 marks the process critical.

      0: kd> ed fffffa82fa924b30+0x440 0x144d0801+0x2000

      0: kd> ed 89038a08+0x228 0x450801+0x2000


    15. Now close the debugger and proceed with the repro steps to crash or close the application.
    16. In our case the test application with the code mentioned above should cause the machine to bugcheck as soon as any key is pressed.


    The complete memory dump will contain the process information as well as kernel data for investigation.

  • Ntdebugging Blog

    Understanding ARM Assembly Part 3


    My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 3 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  In part 2 we talked about how Windows utilizes that ARM processor.  In this part we will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.


    Calling Conventions

    In ARM there is only one calling convention.  The calling convention for ARM is simple.  The first four 32 bit or smaller variables are passed in R0-R3.  The remaining values go onto the stack.  If any of the first four variables are 8 or 16 bit in size then they will be padded with zeros to fill the 32-bit register.  If any of the first four variables are 64 bit in size then they have to be 64 bit aligned.  That means that the variable will be split across an even/odd register pair.  Example is R0/R1 or R2/R3.  Here is an example:

      Registers                                                                                                      Stack












    1. Foo (int I0, int I1, int I2, int I3)
    2. Registers                                                                                                      Stack












    3. Foo (int I0, double D, int I1)
    4. Registers                                                                                                      Stack












    5. Foo (int I0, int I1, double D)
    6. Registers                                                                                                      Stack












    In the first example the function Foo takes four integer values.  All of these are passed in the registers R0 - R3.  This one is pretty simple.


    In the second example the function Foo takes an integer, a double, and another integer.  The first integer is put into R0.  However note that the double has to be in an even/odd pair and therefore R1 is unused, and the double gets put into R2/R3.  The last integer is pushed onto the stack.  This leaves R1 unused.  Programmers are suggested to not use this type.  Instead organize your variables to where they will fit like in the third example.  Also in this example the stack has to be word aligned, so there will be an additional unused word pushed and popped in order to keep the alignment.  Also note that on ARM that a Byte is 8 bits, a Halfword is 16 bits, and a Word is 32 bits.


    In the third example the function Foo takes two integers and a double.  As you can see the first two variables are integers and they go in R0 and R1 respectively.  The last variable the double will then be aligned to go into R2/R3.


    The registers R4-R11 are used to hold the values of the local variables of a subroutine.  A subroutine is required to preserve on the stack the contents of the registers R4-R8, R10, R11, and SP.


    Return values are always in R0 unless they are 64 bits in size then a combination of R0 and R1 is used.


    Calling convention for floating point operations are pretty much the same.  A function can have up to 16 single-precision values in S0-S15, or 8 double-precision values in D0-D7, or 4 SIMD vectors in Q0-Q3.  Example if you have a function that takes the following combination:

    Float, double, double, float


    They will go into S0, D1, D2, S1 respectively.  These are aggressively back-filled.


    Floating point return values are in S0/D0/Q0 as appropriate by size.


    This means that S16-S31/D8-D31/Q4-Q15 are volatile.


    Prolog and Epilog

    The Prolog on an ARM processor does the same thing as the x86 processor, it stores registers on the stack and adjusts the frame pointer.  Let`s look at a simple example from hal!KfLowerIrql.



    push        {r3,r4,r11,lr}  ; save non-volatiles regs used, r11, lr
    addw        r11,sp,#8       ; new frame pointer value in r11...

    ...                         ; stack used in prolog is multiple of 8


    As you can see the push instruction is different than x86.  On x86 we would have four push instructions to do the same thing that ARM is doing in one instruction.  This stores the registers in consecutive memory locations ending just below the address in SP, and updates SP to point to the start of the stored location.  The lowest numbered register is stored in the lowest memory address, through to the highest numbered register to the highest memory address.  We can see that here:


    1: kd> r

    r0=0000000f  r1=e1070180  r2=00000000  r3=e0eb3675  r4=e1048cc8  r5=e10651fc

    r6=00001000  r7=0000006a  r8=c5561d10  r9=0000000f r10=e10acc80 r11=c5561d08

    r12=ef890f1c  sp=c5561cc8  lr=e1298a0f  pc=e0eb3678 psr=400001b3 -Z--- Thumb


    1: kd> dds c5561cc8 c5561d08

    c5561cc8  e0eb3675   <-- r3

    c5561ccc  e1048cc8   <-- r4

    c5561cd0  c5561d08   <-- r11

    c5561cd4  e1298a0f   <-- lr


    The addw instruction is setting up the new frame pointer.  This will add 8 to the value in sp, and store that in r11 which is the frame pointer.  Here is what that looks like in the debugger:


    kd> r

    r0=0000000f  r1=00000002  r2=00000002  r3=e133b675  r4=77e31f15  r5=02cc9ad5

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22cb710 r11=e22cb5b8

    r12=26ebcf96  sp=e22cb5b0  lr=e0f2560b  pc=e133b67c psr=400000b3 -Z--- Thumb



    As you can see r11 is now 8 higher than sp.


    Now let`s look at the Epilog for hal!KfLowerIrql.  It is pretty simple as it is one command.



    pop         {r3,r4,r11,pc}  ; restore non-volatile regs, r11, return


    This is going to pop the first three registers from the stack back into their original registers.  However the last one is poping what was the link register (lr) into the program counter (pc).  This acts as a return, performing a similar function as what the RET instruction does on x86 but without using a unique instruction.  Program flow is controlled by manipulating the pc register.  Here is what this looks like in the debugger.


    The registers before the pop instruction runs:

    kd> r

    r0=0000000f  r1=00000006  r2=00000000  r3=e1035000  r4=0000000f  r5=306f0a07

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c9260 r11=e22c9108

    r12=26ebaae6  sp=e22c9100  lr=e0f2560b  pc=e133b6b4 psr=200000b3 --C-- Thumb


    e133b6b4 e8bd8818 pop         {r3,r4,r11,pc}


    The registers after the pop instruction runs:

    kd> r

    r0=0000000f  r1=00000006  r2=00000000  r3=e133b675  r4=51cae4a2  r5=2aede545

    r6=00000000  r7=e1035580  r8=0000000f  r9=00000000 r10=e22c8d20 r11=e22c8c10

    r12=26eba5a6  sp=e22c8bd0  lr=e0f2560b  pc=e0f2560a psr=200000b3 --C—Thumb


    Now we are going to complicate this a bit by showing a function that has local variables, NtCreateFile.



    push        {r4,r5,r11,lr}  ; save non-volatiles regs used, r11, lr    

    addw        r11,sp,#8       ; new frame pointer value in r11
    sub         sp,sp,#0x30     ; local variables

    ...                         ; stack used in prolog is multiple of 8


    Notice that this looks the same as the previous prolog, but one line is added.  The sub sp,sp,#0x30 is used to make stack space available for local variables.  This adds one instruction to the Epilog as well.


    Epilog :

    add          sp,sp,#0x30     ; cleanup local variables
    pop         {r4,r5,r11,pc}   ; restore non-volatile regs, r11, return


    The add sp,sp,#0x30 is used to clean up the stack of the local variables.


    One more prolog/epilog example.  This one is of IopCreateFile.  It saves the arguments that come in to the stack first.


    Prolog :

    push        {r0-r3}           ; save r0-r3
    push        {r4-r11,lr}       ; save non-volatiles r4-r10, r11, lr
    addw       r11,sp,#0x1c       ; new frame pointer value in r11
    sub          sp,sp,#0x3c      ; local variables

    ...                           ; stack used in prolog is multiple of 8


    As you can see this prolog is mostly the same, there is just one additional line for pushing the r0-r3 argument registers to the stack.


    The epilog for this one is a little different.



    add         sp,sp,#0x4c        ; cleanup local variables from stack
    pop         {r4-r11}           ; restore non-volatiles, frame pointer r11
    ldr          pc,[sp],#0x14     ; return and cleanup 0x14 bytes (lr,r0-r3)


    Notice that the pop is not putting lr into pc for a return.  Instead the last statement is taking care of the pc register.  This instruction is calculating the pc address by adding 14 to the value in sp, and putting that into pc.  This cleans up the arguments and lr from the stack at the same time.  This ldr instruction is similar to the ret instruction on x86.


    The last thing we are going to cover is called a "Leaf function".  A Leaf function executes in the context of the caller.  It does not have a prolog and does not use the stack.  It only uses volatile registers r0-r3, and r12.  It returns via the "bx lr" command.  Example of this is KeGetCurrentIrql.  Here is what it looks like in the debugger.


    kd> uf hal!KeGetCurrentIrql

    hal!KeGetCurrentIrql  211 e132b650 f3ef8300 mrs         r3,cpsr

      216 e132b654 f0130f80 tst         r3,#0x80

      216 e132b658 d103     bne         hal!KeGetCurrentIrql+0x12 (e132b662)


      216 e132b65a b672     cpsid       i

      216 e132b65c 0000     movs        r0,r0

      216 e132b65e 2201     movs        r2,#1

      216 e132b660 e000     b           hal!KeGetCurrentIrql+0x14 (e132b664)


      216 e132b662 2200     movs        r2,#0


      217 e132b664 ee1d3f90 mrc         p15,#0,r3,c13,c0,#4

      217 e132b668 7f18     ldrb        r0,[r3,#0x1C]

      218 e132b66a b10a     cbz         r2,hal!KeGetCurrentIrql+0x20 (e132b670)


      218 e132b66c b662     cpsie       i

      218 e132b66e 0000     movs        r0,r0


      220 e132b670 4770     bx          lr


    The stack must remain 4 byte aligned at all times, and must be 8 byte aligned in any function boundary.  This is due to the frequent use of interlocked operations on 64-bit stack variables.


    Functions which need to use a frame pointer (for example, if alloca is used) or which dynamically change the stack pointer within their body, must set up the frame pointer in the function prologue and leave it unchanged until the epilog. Functions which do not need a frame pointer must perform all stack updating in the prolog and leave the SP unchanged until the epilog.


    Rebuilding the Stack

    Here we are going to discuss how to rebuild the stack from the frame pointer.


    The frame pointer points to the top of the stack area for the current function, or it is zero if not being used.  By using the frame pointer and storing it at the same offset for every function call, it creates a singly linked list of activation records.


    The frame pointer register points to the stack backtrace structure for the currently executing function. 


    The saved frame pointer value is (zero or) a pointer to the stack backtrace structure created by the function which called the current function. 


    The saved frame pointer in this structure is a pointer to the stack backtrace structure for the function that called the function that called the current function; and so on back until the first function. 



    In the below diagram Main calls Foo which calls Bar



    For more information about ARM Debugging check out this article from T.Roy at Code Machine:


  • Ntdebugging Blog

    Debugging a Windows 8.1 Store App Crash Dump (Part 2)


    In Part 1, we covered the debugging of a Windows Store Application crash dump that contains a Stowed Exceptions Version 1 (SE01) structure.


    This post continues on from Part 1, covering the changes introduced in March 2014. These Windows Updates changed the way language exceptions (RoOriginateLanguageException) are recorded in Windows Store Application crash dump files. The new Stowed Exception Version 2 (SE02) structure adds additional fields that directly associate the exception with a language exception object.


    You’ll recall from the Part 1 that the CLR Exception is loosely associated with the Stowed Exception v1 structure by comparing the HRESULT of the Stowed Exception with the HRESULT of the last CLR Exception on the default thread (the exception record thread). V2 makes this relationship direct. You’ll discover that the Last CLR Exception no longer exists in the v2 dump and that it must be referenced directly by the address stored in the Stowed Exception.


    The direct association was added to v2 to also aid triage dump carving (done by Windows Error Reporting). It allows WER to explicitly add the memory associated with the relevant Language (CLR) Exception. This eliminates the risk of the garbage collector freeing the memory associated with the last CLR Exception before the dump is taken.  This also helps identify which exception is related to the final crash, which can be difficult when there are multiple exceptions in the dump.


    Debug Steps

    The steps to debug a v2 structure are similar to v1. You first determine the number of stowed exception entries (.exr -1), look at the header to determine the version, display the array of stowed exceptions cast to the correct type (dt -aN …), and then extract the native stack (dpS) or text (du) for each entry.


    Instead of then comparing the HRESULT to the last CLR Exception (!sos.pe), you use the Nested Exception member to get to the innermost CLR Exception. Due to way object pointers are handled by the CLR, the address is a CCW (COM Callable Wrapper) address, not a CLR object address. To get the CLR object’s address, you use the !sos.dumpccw command. This provides the CLR object address, which can be passed to the !sos.pe command to display the exception.


    OK, let’s do all of that, showing the commands and data fields of note along the way. (A lot of this is similar to the previous post.)


    If not done already, set your symbol path to the Microsoft Public Symbol server:

    0:003> .sympath SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

    Symbol search path is: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

    Expanded Symbol search path is: srv*c:\Symbols*http://msdl.microsoft.com/download/symbols

    ************* Symbol Path validation summary **************

    Response                         Time (ms)     Location

    Deferred                                       SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols


    Force the load of the symbols using the .reload /f command:

    0:003> .reload /f



    The next step is to display the pointer array as the original structure type. First, we need to know what structure to cast the pointer array to. Using the Parameter[0] value from .exr -1, we will generate a dt command that will display the header of the first record. We use Parameter[0] as the address in this command.

    dt  <Parameter[0]> combase!STOWED_EXCEPTION_INFORMATION_HEADER*


    Here’s an example:

    0:003> .exr -1

    ExceptionAddress: 7575b152 (combase!RoFailFastWithErrorContextInternal+0x0000010b)

       ExceptionCode: c000027b

      ExceptionFlags: 00000001

    NumberParameters: 2

       Parameter[0]: 00c6d3d0

       Parameter[1]: 00000002


    0:003> dt 00c6d3d0 combase!_STOWED_EXCEPTION_INFORMATION_HEADER*


       +0x000 Size             : 0x28

       +0x004 Signature        : 0x53453032


    The value of the Signature member (0x53453031) is converted to a string using .formats <value>.

    0:003> .formats 0x53453032

    Evaluate expression:

      Hex:     53453032

      Decimal: 1397043250

      Octal:   12321230062

      Binary:  01010011 01000101 00110000 00110010

      Chars:   SE02

      Time:    Wed Apr 09 04:34:10 2014

      Float:   low 8.46917e+011 high 0

      Double:  6.90231e-315

    • “SE01” maps to combase!STOWED_EXCEPTION_INFORMATION_V1
    • “SE02” maps to combase!STOWED_EXCEPTION_INFORMATION_V2


    Now that we know the type, we can again use the values from .exr -1 to generate a dt command that will display each record. We use the Parameter[0] as the address, and Parameter[1] as the count in the command. We add a “P” to the start of the type as this is an array of pointers to the type, not structures packed next to each other.


    In this example, there are 2 pointers, so 2 records are displayed:

    dt -a<Parameter[1]> <Parameter[0]> combase!PSTOWED_EXCEPTION_INFORMATION_V2


    Note, there is no space between the -a and <Parameter[1]>.

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0



       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000100000001111 (0x80f)

       +0x010 ExceptionAddress : 0x756b3bff Void

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 3

       +0x01c StackTrace       : 0x0619a368 Void

       +0x010 ErrorText        : 0x756b3bff  "???"

       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void


    [1] @ 00c6d3d4



       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000000000000000 (0)

       +0x010 ExceptionAddress : (null)

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 0x3f

       +0x01c StackTrace       : 0x0639bf4c Void

       +0x010 ErrorText        : (null)

       +0x020 NestedExceptionType : 0

       +0x024 NestedException  : (null)


    Native Call Stack

    Regardless of whether the error code (ResultCode) is known or unknown, it is useful to determine the location of the (native) issue by viewing the (native) call stack.


    Symbol Pointers

    If the ExceptionForm member has a value of 0y01, the structure’s union represents a call stack.


    Unlike call stacks associated with threads, where the symbol pointers are placed throughout the stack next to local variables, these symbols pointers are packed tightly at the address specified in the StackTrace member. Think of it as an array of EBP addresses. The dpS command is used to display the call stack.

    • It is important to include a limit (L) as the call stack is regularly longer than the default 10 rows displayed by dpS. The limit’s value is in the StackTraceWords member.
    • Note that capital S is used (dps vs dpS) because we want to omit the first column normally displayed by dps; the location of the symbol pointer is irrelevant.
    • If you aren‘t using the same bitness debugger as the target’s bitness, use ddS for StackTraceWordSize = 4 (32-bit), and dqS for StackTraceWordSize = 8 (64-bit).

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0



       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 80004001

       +0x00c ExceptionForm    : 0y01

       +0x00c ThreadId         : 0y000000000000000000100000001111 (0x80f)

       +0x010 ExceptionAddress : 0x756b3bff Void

       +0x014 StackTraceWordSize : 4

       +0x018 StackTraceWords  : 3

       +0x01c StackTrace       : 0x0619a368 Void

       +0x010 ErrorText        : 0x756b3bff  "???"

       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void


    0:003> dpS 0x619a368 L3

    756ea9f1 combase!RoOriginateLanguageException+0x3b

    63b2b04d clr!SetupErrorInfo+0x1e1

    63bf4511 clr!MarshalNative::GetHRForException_WinRT+0x7d


    Unicode String Pointer

    If the ExceptionForm member has a value of 0y10, the structure’s union represents an error message.


    The call stack is (hopefully) contained within the Unicode string pointed at by the ErrorText member. As the text is defined by the caller, the existence of a call stack text isn’t guaranteed.

    0:003> dt –a1 13f117e0 combase!PSTOWED_EXCEPTION_INFORMATION_V1

    [0] @ 13f117e0



       +0x000 Header           : _STOWED_EXCEPTION_INFORMATION_HEADER

       +0x008 ResultCode       : 8000ffff

       +0x00c ExceptionForm    : 0y10

       +0x00c ThreadId         : 0y000000000000000000010101110100 (0x574)

       +0x010 ExceptionAddress : 0x0de38f7c Void

       +0x014 StackTraceWordSize : 0

       +0x018 StackTraceWords  : 0

       +0x01c StackTrace       : (null)

       +0x010 ErrorText        : 0x0de38f7c  "System.Exception..   at Windows.UI.Xaml.VisualStateManager.GoToState(Control control, String stateName, Boolean useTransitions)..   at MyBadApp.Common.LayoutAwarePage.InvalidateVisualState()..   at MyBadApp.Common.LayoutAwarePage.WindowSizeChanged(Object sender, WindowSizeChangedEventArgs e)"


    Note - These records aren’t used with v2 language exceptions (or if they are, they are extremely rare based on the Windows Error Reporting telemetry).


    Nested Exceptions

    The new fields in the v2 structure are the NestedExceptionType and NestedException members. The NestedExceptionType member is one of the following values. Much like the Signature field, you can use .formats <value> to see the characters each code represents. The possible values and their associated meaning are:

    • W32E – Win32 Exception – points to an EXCEPTION_RECORD structure
    • STOW – Stowed Exception – points to a STOWED_EXCEPTION_INFORMATION_* structure
    • CLR1 – CLR Object – points (directly) to a CLR Object
    • LEO1 – Language Exception Object – points indirectly to a CLR Exception object


    LEO1 is the only style being generated by Windows Error Reporting for CLR Exceptions raised in Windows Store Applications.


    Looking at the example dump file we have been using, it can be seen that the first Stowed Exception has values for the NestedException and NestedExceptionType fields, and they are NULL in the second. Using .formats tells us that the NestedExceptionType member is of type “LEO1”. Note that this is displayed backwards in the output below, in accordance with little-endian order of Intel memory layout.

    0:003> dt -a2 00c6d3d0 combase!PSTOWED_EXCEPTION_INFORMATION_V2

    [0] @ 00c6d3d0




       +0x020 NestedExceptionType : 0x314f454c

       +0x024 NestedException  : 0x063a95d4 Void


    0:003> .formats 0x314f454c

    Evaluate expression:

      Hex:     314f454c

      Decimal: 827278668

      Octal:   06123642514

      Binary:  00110001 01001111 01000101 01001100

      Chars:   1OEL

      Time:    Tue Mar 19 16:37:48 1996

      Float:   low 3.01619e-009 high 0

      Double:  4.0873e-315


    Passing the address to !sos.dumpccw provides the CLR Exception object’s address.

    0:003> !sos.dumpccw 0x063a95d4

    CCW:               0499f880

    Managed object:    02517288

    Outer IUnknown:    00000000

    Ref count:         1


    RefCounted Handle: 00a31478 (STRONG)

    COM interface pointers:

          IP       MT Type


    The address can be used with !sos.pe to display the CLR Exception object. The call stack that the failure investigation should focus on is in this output.

    0:003> !sos.pe 02517288

    Exception object: 02517288

    Exception type:   System.NotImplementedException

    Message:          The method or operation is not implemented.

    InnerException:   <none>

    StackTrace (generated):

        SP       IP       Function

        04F2E38C 00B81382 CrashStore!CrashStore.MainPage.Load_Click_1(System.Object, Windows.UI.Xaml.RoutedEventArgs)+0x62


    StackTraceString: <none>

    HResult: 80004001


    There you have it. This is the CLR Exception that you need to find to start your code analysis or to point you in the right direction when beginning tracing.


    But what if SOS is not available?

    What do you do if SOS isn’t available? You can check if it is loaded by running the .chain command, and you can check if it is functional by running !sos.dumpccw command (without a parameter).


    Firstly, make sure you are using the same bitness of the debugger as the bitness of the target.


    If the dump says “x86” or “ARM (Thumb2)” in the version command or the initial debug spew, use the 32bit debugger.

    Windows 8 Version 9600 MP (4 procs) Free x86 compatible


    If the dump says “x64” in the version command or the initial debug spew, use the 64bit debugger.

    Windows 8 Version 9200 MP (4 procs) Free x64


    If you still don’t have SOS loaded (or working) after matching the bitness, or you get one of the following errors, you’ll have to debug the dump on a system with the same version of the CLR installed. Some CLR versions weren’t indexed and this causes the automatic download of sos.dll and mscordacwks.dll to fail.

    0:003> !sos.dumpccw

    Failed to load data access DLL, 0x80004005

    Verify that 1) you have a recent build of the debugger (6.2.14 or newer)

                2) the file mscordacwks.dll that matches your version of clr.dll is

                    in the version directory or on the symbol path

                3) or, if you are debugging a dump file, verify that the file

                    mscordacwks_<arch>_<arch>_<version>.dll is on your symbol path.

                4) you are debugging on supported cross platform architecture as

                    the dump file. For example, an ARM dump file must be debugged

                    on an X86 or an ARM machine; an AMD64 dump file must be

                    debugged on an AMD64 machine.


    You can also run the debugger command .cordll to control the debugger's

    load of mscordacwks.dll.  .cordll -ve -u -l will do a verbose reload.

    If that succeeds, the SOS command should work on retry.


    If you are debugging a minidump, you need to make sure that your executable

    path is pointing to clr.dll as well.


    0:003> .cordll -ve -u -l

    CLRDLL: C:\Windows\Microsoft.NET\Framework\v4.0.30319\mscordacwks.dll:4.0.30319.18444 f:8

    doesn't match desired version 4.0.30319.34011 f:8

    CLRDLL: Unable to find mscordacwks_x86_x86_4.0.30319.34011.dll by mscorwks search

    CLRDLL: Unable to find 'mscordacwks_x86_x86_4.0.30319.34011.dll' on the path

    CLRDLL: Unable to get version info for 'c:\my\sym\cl\clr.dll\52968A96698000\mscordacwks_x86_x86_4.0.30319.34011.dll', Win32 error 0n87

    Cannot Automatically load SOS

    CLRDLL: ERROR: Unable to load DLL mscordacwks_x86_x86_4.0.30319.34011.dll, Win32 error 0n87

    CLR DLL status: ERROR: Unable to load DLL mscordacwks_x86_x86_4.0.30319.34011.dll, Win32 error 0n87


    0:003> .chain

    Extension DLL search Path:


    Extension DLL chain:

        C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos: image 4.0.30319.18444, API 1.0.0, built Wed Oct 30 14:40:34 2013

            [path: C:\Windows\Microsoft.NET\Framework\v4.0.30319\sos.dll]

        pde.dll: image 9, 4, 0, 0, API 9.4.0, built Thu May 08 20:03:58 2014

            [path: c:\debuggers_x86\winext\pde.dll]

        dbghelp: image 6.3.9600.16384, API 6.3.6, built Wed Aug 21 20:59:03 2013

            [path: c:\debuggers_x86\dbghelp.dll]

        ext: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:11:11 2013

            [path: c:\debuggers_x86\winext\ext.dll]

        exts: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:14 2013

            [path: c:\debuggers_x86\WINXP\exts.dll]

        uext: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:09 2013

            [path: c:\debuggers_x86\winext\uext.dll]

        ntsdexts: image 6.3.9600.16384, API 1.0.0, built Wed Aug 21 21:04:34 2013

            [path: c:\debuggers_x86\WINXP\ntsdexts.dll]



    As discussed in the previous article, the asynchronous and projected nature of Windows Store applications makes them significantly harder to debug than desktop applications. Stowed Exceptions v2 helps definitively determine the error code and call stack of the exception that caused the crash.


    Solutions to some of the more common issues have been talked about on episodes of Channel 9 Defrag Tools, and also in Avoiding Windows Store App Failures talk at //build/ 2014 and the Hardcore Debugging talk at TechEd 2014.


    If you have any questions, please feel free to email us at DefragTools@microsoft.com, we’ll be happy to help you.

  • Ntdebugging Blog

    Understanding ARM Assembly Part 2


    My name is Marion Cole, and I am a Sr. Escalation Engineer in Microsoft Platforms Serviceability group.  This is Part 2 of my series of articles about ARM assembly.  In part 1 we talked about the processor that is supported.  Here we are going to talk about how Windows utilizes that ARM processor.


    As we discussed in part 1 Windows runs on the ARMV7-A with NEON.  We discussed the CPSR register in part 1.  There are a few bits that are important in the CPSR.  The first one is the Endian State bit:


















































    Bit 9 (the E bit) indicates the EndianState.  This bit should always be a 0 because Windows only runs in Little-Endian state.  So if you get a dump, and see the CPSR bit 9 is set then you have a problem.  Here is an example from the debugger:

    1: kd> r

    r0=00000001  r1=00000001  r2=00000000  r3=00000000  r4=e1074044  r5=c555b580

    r6=00000001  r7=e104ca39  r8=00000001  r9=00000000 r10=e9bf06c7 r11=d5f1ea08

    r12=e16b213c  sp=d5f1e9b0  lr=e0f0fe2f  pc=e0fdebd0 psr=00000133 ----- Thumb


    e0fdebd0 defe     __debugbreak


    1: kd> .formats 00000133

    Evaluate expression:

      Hex:     00000133

      Decimal: 307

      Octal:   00000000463

      Binary:  00000000 00000000 00000001 00110011  ßBit 9 is 0.  Note first bit is Bit 0. 

      Chars:   ...3

      Time:    Wed Dec 31 18:05:07 1969

      Float:   low 4.30199e-043 high 0

      Double:  1.51678e-321


    So how could Bit 9 ever be a 1?  The SETEND instruction in the ARM ISA allows even user mode code to change the current endianness, doing so will be dangerous for an application and is discouraged.  If an exception is generated while in big-endian mode the behavior is unpredictable, but may lead to an application fault (user mode) or bugcheck (kernel mode).


    The next bit we are going to discuss is bit 5, the Thumb bit (the T bit).  This should be a 1 if executing Thumb instructions.  So let’s discuss the different instruction sets the ARM processor has.


    ARMv7 has four different ISA's for programming. 

    • ARM - basic ARM instruction set including conditional execution.
    • Thumb - This mode uses a 16 bit instruction encoding to reduce code footprint.  It has limitations with respect to register access and some system instructions aren't implemented for Thumb.
    • Thumb2 - This extension of the Thumb instruction set adds 32 bit opcode encodings and adds enough facilities to author an entire OS.  Support for Thumb2 is guaranteed in the ARMv7 architecture revision.
    • Jazelle - Java code interpretation.
    • ThumbEE - a limited version of Thumb2 intended as a code generation target for JIT scenarios.


    Windows requires Thumb2 support.  The advantage of using Thumb2 is that the combination of 16 and 32 bit opcodes along with some other ISA improvements allows for saving 20-30% code footprint at a 1-2% performance loss.  In addition the cache hit rate is improved due to increased density of the code.


    CPSR Bit 5 should always be 1 as Windows only runs in Thumb2 mode.  Also note that this bit is combined with bit 24, the Java state bit (the J bit).  Bit 24 should always be 0 when running Windows.


    The next bits to discuss are the CPU Mode bits 4-0 (M).  Windows only runs in two modes.  They are User Mode (10000) and Supervisor Mode (10011).  If Bits 4-0 are anything other than the indicated values given an exception will be raised.  Kernel will run in Supervisor Mode, and applications will run in User Mode.


    That brings up another point.  How does the processor switch between Supervisor Mode and User Mode?  It is called the SVC call.  In the x86 processor this was done via SYSENTER/SYSEXIT.  In x64 processor this was done via SYSCALL/SYSRET.  In ARM this is done via the SVC or Supervisor Call.  This call is made to have the kernel provide a service.  When invoked in ntdll.dll the service number is in r12.  Here is an example:

    1: kd> u ntdll!ZwQueryVolumeInformationFile

    771e8674    f04f0c8d    mov   r12,#0x8D
    771e8678    df01        svc   #1
    771e867a    4770        bx    lr


    When SVC is called the previous CPSR register is saved in the SPSR register (the Saved Program Status Register), and pc register is saved in lr register (the Link Register).  The processor then changes to kernel mode (0x13) with interrupts disabled.  The lr and SPSR values are used to generate a return from the SVC call.  When an exception is taken the stack is untouched, the previous mode's SP and LR are left alone, new modes SP becomes active, exception address is stored in the new mode's LR, and the previous CPSR is copied into the new mode's SPSR.  When returning from the exception the SPSR is copied back into the CPSR, and it returns to LR.


    Data Types

    ARMv7 processors support four data types from 8 bits to 64 bits, but the definitions are different than the ones in Windows.  In Windows 16 bits are defined as a word, on ARM a word is 32 bits.


    8 bits


    16 bits


    32 bits


    64 bits


    These can be signed or unsigned.

    • Unsigned 32 bit integer
    • Signed 32 bit integer
    • Unsigned 16 bit integer (zero extended)
    • Signed 16 bit register (sign extended)
    • Unsigned 8 bit integer (zero extended)
    • Signed 8 bit register (sign extended)
    • Two 16 bit integers
    • Four 8 bit integers
    • The upper or lower 32 bits of a 64 bit signed value whose other half is in another register
    • The upper or lower 32 bits of a 64 bit unsigned value whose other half is in another register


    Memory Model

    The ARM memory model is much like other architectures that we have supported.  ARM has a "weak ordering" memory model.  This means that two memory operations that occur in program order, may be observed from another processor or DMA controller in any order.  When an instruction stalls because it is waiting for the result of a preceding instruction, the core can continue executing subsequent instructions that do not need to wait for the unmet dependencies.  There are three instructions that allow you to configure memory barriers:

    • ISB - Instruction Synchronization Barrier
    • DMB - Data Memory Barrier
    • DSB - Data Synchronization Barrier


    An excellent blog article on this topic with an explanation of these three instructions is available at:



    Alignment and Atomicity

    Windows enables the ARM hardware to handle misaligned integer accesses transparently; however, there are still several situations where alignment faults may be generated on misaligned accesses. Follow the rules below:

    • Halfword and word-sized integer loads and stores do NOT need to be aligned (hardware will handle them efficiently and transparently)
    • Floating-point loads and stores SHOULD be aligned (the kernel will handle them transparently, but with significant overhead)
    • Load/store double (LDRD/STRD) and multiple (LDM/STM) operations SHOULD be aligned (the kernel will handle most of them transparently, but with significant overhead)
    • All uncached memory accesses MUST be aligned, even for integer accesses (you will get an alignment fault)


    Note that the memcpy() implementation provided by the Windows CRT presumes the copies are to/from cached memory, and thus leverages the hardware’s support for transparently handling misaligned integer reads and writes with little penalty. This means that memcpy() CANNOT be used when the source or destination is uncached memory. Instead, use the separate function _memcpy_strict_align(), which only performs aligned accesses.


    There are two types of atomicity supported.  Single-copy and Multi-copy.


    Single-copy atomicity

    There are rules around atomicity that are intended to specify the cases where memory access behavior in relation to program order can be guaranteed.  So certain access (aligned word accesses) are guaranteed by the architecture to return sensible results even if other threads are accessing the same memory.  These rules are necessary in order to guarantee that the programmer (and compiler) can rely on correct behavior of memory in the majority of the cases.


    Multi-copy atomicity

    These rules are similar, but relate specifically to multi-processing environments in which several observers may be using a particular item in memory.  To be able to guarantee correct behavior you need to be able to assume that memory behaves in a consistent way.


    More on Single-Copy and Multi-Copy atomicity in the ARM Architecture Reference Manual available from http://infocenter.arm.com/help/index.jsp.


    Common Assembly Instructions

    We are going to cover some common Thumb2 instructions.

    • ldr           r0, [r4]                  (ldrex, ldrh ldrb, ldrd, ldrexd, etc.)

      This is the Load Register instruction.  In the above example r0 is the destination register, and r4 is the base register.  This will take the address that is in r4, go to that memory location and copy the contents of that memory location into r0.

    • str           r2, [r4, #0x08]                    (strex, strh, strexh, strd, etc.)

      This is the Store Register instruction.  In the above example r2 is the source register, and r4 is the base register.  This will take the address in r4 and add 8 to that address.  It will take the value that is in r2, and store it at the address pointed to by r4 plus 8.

    • mov       r1, r4                                      (movs – sets the condition codes)

      This is the Move instruction.  In the above example r1 is the destination register, and r4 is the source register.  It will do the same thing as x86 in that it just copies what is in r4 to r1.  It can optionally updated the condition flags based on the value.

    • adds      r1, r5, #0                              (add)

      This is the Add instruction.  In the above example r1 is the destination register.  This will take the value that is in r5 and add 0 to it.  It will store the result in r1.  Because this has an (s) at the end of add it will update the flags.

    • sub         sp, sp, #0x14                      (subs)

      This is the Subtract instruction.  In the above example sp is the destination.  This will take the value that is in sp, subtract 14h from it, and store the result in sp. Because this does not have an (s) at the end it will not update the flags.

    • push      {r4-r9, r11, lr}

      This is the Push instruction.  It can push multiple registers to the stack in one instruction.  You can separate a full series of register with the beginning register "-" and ending register like seen above.  You can also list them all, and just separate them by ",".  This operates the same as an x86 processor in that it subtracts 4 from the stack pointer for each push.

    • pop        {r4-r9, r11, lr}

      This is the Pop instruction.  It pulls values from the stack back into the registers you list.  The registers work just like the push instruction.  This operates the same as an x86 processor in that it adds 4 to the stack pointer for each pop.

    • b??         |MyApp!main+0x60 (00b81348)|

      This is the Branch instruction.  This is equivalent to the jmp instruction in x86.  However it has several conditional variants such as "beq, bge, and etc.".

    • bx           r3

      This is the Branch and Exchange instruction.  This causes a branch to an address and instruction set specified by a register (r3 here).  This can do a long branch anywhere in the 32-bit address range.

    • bl            |MyApp!Function (00b815c4)|

      This is the Branch with Link instruction.  This calls a subroutine at a PC-relative address.  This will update the lr register.

    • blx          r3

      This is the Branch with Link and Exchange.  This calls a subroutine at an address and instruction set specified by a register (r3 here).  This will do a long branch anywhere in the 32-bit address range, and update the lr register.

    • dmb      

      This is the Data Memory Barrier instruction.  It is a memory barrier that ensures the ordering of observations of memory accesses.

    • cmp       r3, #0

      This is the Compare instruction.  It will subtract 0 from the value in r3, and set the flags accordingly. 


    In ARM addressing the base register points to memory being referenced.  The offset can be an immediate or an index register.  The memory stored at the base register`s address plus the offset is accessed.  The base register remains unchanged.  Example:

    Ldr r5,[r9,#0x1c]


    This will take the value that is in r9 and add 0x1C to it, go to that memory location, and retrieve the value there and store it in r5.  R9 will remain the same value.


    ARM also has some interesting thing about indexing.  They have Pre-Indexed addressing, Offset Addressing, and Post-Indexed Addressing.


    Pre-Indexed addressing the value of the base register is first modified by the offset then the memory pointed to by the modified base register is accessed.  Example:

    Str r2,[r4,#0x4]!


    The "!" at the end of the instruction is not a mistake.  This is how you tell it is a Pre-Indexed address. 


    Offset Addressing.  The value is added to the base register, and that is used as the address for memory access.  If the "!" was not there then this would just be Offset addressing.  Example:

    Str r2,[r4,#0x4]


    Post-Index addressing the memory address in the base register is accessed then afterwards the base register is modified by the offset value.  Example:

    Ldr pc,[sp],0x1c


    Notice the "!" is missing here.  Also notice the offset is outside the "[ ]".  That is how you can find a Post-Index.


    Part 3 of this series will cover Calling Conventions, Prolog/Epilog, and Rebuilding the stack.

Page 1 of 24 (239 items) 12345»