• Ntdebugging Blog

    MmCm – A Non Paged Pool Accounting Adventure

    • 1 Comments

    Here's one from the Rube Goldberg debug collection!

    The dripping sarcasm is because I'm about to show the reeeeally long way to figure out what's eating MmCm, skip down to the end if you are in a time crunch.

     

    Otherwise, do resist the temptation to skip ahead as some of the techniques can be used in other debugs.

     

    -Tate

     

     

    What is this Pool Tag?

     

    MmCm is a Non Paged Pool (NPP) tag allocated via a kernel mode call (a driver or the OS itself…applications cannot allocate this memory directly) to either MmAllocateContiguousMemory  or MmAllocateContiguousMemorySpecifyCache.

     

    Great comments from msdn…

     

    MmAllocateContiguousMemory can be called to allocate a contiguous block of physical memory for a long-term internal buffer, usually from the DriverEntry routine…

     

    A device driver that must use contiguous memory should allocate only what it needs during driver initialization because physical memory is likely to become fragmented as the operating system runs.

     

    Warning  If you use the MmAllocateContiguousMemory routine on computers with large amounts of memory, the operating system's performance might severely degrade when the system tries to create a contiguous chunk of memory. This degradation is reduced for Windows Server 2008. Memory that MmAllocateContiguousMemory allocates is uninitialized...

     

     

    First, how much is normal?

     

    As the comments indicate it is contingent mostly on the driver's needs, done at initialization time.  We have seen consumption in the single MB range up to tens of MB consumed depending on what drivers are loaded.

     

     

    What drivers are the most likely consumers of this memory and why would the amount vary?

     

    The most likely consumers on any machine would be the following types of drivers. Note that each will likely allocate a multiple of some amount per instance for each physical adapter exposed to Windows. (i.e. A single NIC driver used for 3 NICs in a system will likely allocate 3 times the MmCm vs. a single NIC.)

     

    Storage Adapters for your SAN...SCSI Controller, Fibre Channel Controller, etc.

    Video Adapters

    Network Adapters

     

    These drivers need to pre-allocate memory to service transfers and as mentioned need to do this right at boot time since memory can become fragmented later.  In other words, this allocation pattern attempts to "bake-in" a known set of scratch space if you will for the adapter to function properly given the features you wish to use.

     

     

    So what's the problem?

     

    Not everyone plays nice and the machine can run out of memory!  This is downright painful on x86 Exchange servers and their limited NPP ceiling.

     

    Recall from an earlier post and your likely experience, that running out of such memory can cause hangs or general system instability as the machine runs out of NPP.  The scary part is that this usually happens unexpectedly under incrementally higher than normal user demand. i.e when you most need the server it fails.  Not coincidently most often we see older  x86 machines, especially /3GB configured Exchange servers, have this memory consumption problem. I often wonder just how many machines are out there just waiting for a liiiiitle more user load and are hovering at the brink of NPP exhaustion…

     

    <Ok, queue the sad music for the heart breaking x86 Exchange server example>

     

    This server has the standard 128MB maximum for NPP because of /3GB

     

    0: kd> !vm

     

    *** Virtual Memory Usage ***

    Physical Memory:      851418 (   3405672 Kb)

    Page File: \??\C:\pagefile.sys

      Current:   4193280 Kb  Free Space:   4038364 Kb

      Minimum:   4193280 Kb  Maximum:      4193280 Kb

    Available Pages:      639032 (   2556128 Kb)

    ResAvail Pages:       796753 (   3187012 Kb)

    Locked IO Pages:         272 (      1088 Kb)

    Free System PTEs:      27083 (    108332 Kb)

    Free NP PTEs:           5778 (     23112 Kb)

    Free Special NP:           0 (         0 Kb)

    Modified Pages:         4138 (     16552 Kb)

    Modified PF Pages:      4116 (     16464 Kb)

    NonPagedPool Usage:    25151 (    100604 Kb)

    NonPagedPool Max:      32026 (    128104 Kb)

    PagedPool 0 Usage:     11497 (     45988 Kb)

    PagedPool 1 Usage:      1645 (      6580 Kb)

    PagedPool 2 Usage:      1667 (      6668 Kb)

    PagedPool 3 Usage:      1662 (      6648 Kb)

    PagedPool 4 Usage:      1679 (      6716 Kb)

    PagedPool Usage:       18150 (     72600 Kb)

    PagedPool Maximum:     63488 (    253952 Kb)

    Session Commit:          586 (      2344 Kb)

    Shared Commit:          4720 (     18880 Kb)

    Special Pool:              0 (         0 Kb)

    Shared Process:         7517 (     30068 Kb)

    PagedPool Commit:      18214 (     72856 Kb)

    Driver Commit:          8779 (     35116 Kb)

    Committed pages:      237424 (    949696 Kb)

    Commit limit:        1867524 (   7470096 Kb)

     

    Given 100Megs used, 28MB of that is MmCm

     

    0: kd> !poolused /t 10 2

       Sorting by  NonPaged Pool Consumed

     

      Pool Used:

                NonPaged            Paged

    Tag    Allocs     Used    Allocs     Used

    MmCm     3210 28779488         0        0        Calls made to MmAllocateContiguousMemory , Binary: nt!mm

    NDpp     1013  4076960         0        0        packet pool , Binary: ndis.sys

    MPIO   121985  3457752         0        0        UNKNOWN pooltag 'MPIO', please update pooltag.txt

    elxs        6  3299344         0        0        UNKNOWN pooltag 'elxs', please update pooltag.txt

    BCM0       24  3057232         0        0        UNKNOWN pooltag 'BCM0', please update pooltag.txt

    LSwi        1  2654208         0        0        initial work context

     RaME        3  2572288         0        0        RiAllocateMiniportDeviceExtension , Binary: storport.sys

    ElxA        5  2360208         3      160        UNKNOWN pooltag 'ElxA', please update pooltag.txt

    Io        223  2319712       123     5136        general IO allocations , Binary: nt!io

    TPLA      512  2097152         0        0        UNKNOWN pooltag 'TPLA', please update pooltag.txt

    TCPt       40  1662448         0        0        TCP/IP network protocol , Binary: TCP

    Mdl      7225  1435160         0        0        Io, Mdls

     Pool        3  1134592         0        0        Pool tables, etc.

     Devi      602  1124400         0        0        Device objects

     RcpI        1  1048576         0        0        Internal memory mgr initial heap block , Binary: sacdrv.sys

    brcm       38  1032528         0        0        UNKNOWN pooltag 'brcm', please update pooltag.txt

    TOTAL      219941 75555240     70157 56589336

     

    That's not a lot right?  Well, it may be more than you like.  Depending on load cycles on this server it could be too much, you'd have to know a bit more history of the maximum amount of NPP demanded at peak usage, etc. to know how close to the edge this server may be.  Let's go with "what's using the ~28MB?" as the question to answer...

     

     

    Who uses this memory?

     

    Here's where it gets interesting.  A useful heuristic here is to group the sizes of the allocations in Excel.

    You get this data from a !poolfind MmCm

     

    0: kd> !poolfind MmCm

     

    Scanning large pool allocation table for Tag: MmCm (f9a67000 : f9b67000)

     

    *f7fc7000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

    *f8298000 :free large page allocation, Tag was MmCm, size was 0x2000 bytes

    *faeae000 :large page allocation, Tag  is MmCm, size  is 0x4000 bytes

    *faeb2000 :large page allocation, Tag  is MmCm, size  is 0xd000 bytes

    *fae84000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

    *fae95000 :large page allocation, Tag  is MmCm, size  is 0x11000 bytes

    *fae23000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

    *fae24000 :large page allocation, Tag  is MmCm, size  is 0x3000 bytes

    *fae2d000 :large page allocation, Tag  is MmCm, size  is 0x1000 bytes

    .

    .

    .

    fdf2e000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf2f000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf30000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf31000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf32000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf33000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf34000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf35000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf36000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf37000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf38000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf39000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3a000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3b000 size:  f18 previous size:    0  (Allocated) MmCm

    fdf3c000 size:  f18 previous size:    0  (Allocated) MmCm

     

    Searching NonPaged pool (fe000000 : ffb7e000) for Tag: MmCm

     

     

    When I have to do this, usually some variation of following works.

     

    .logopen

    !poolfind

    .logclose

     

    Open the txt file up in notepad, clean it up a bit (remove the frees and extra lines)

    Open the txt in Excel and use the old standby Text Import Wizard via file open

    I usually also save the large pool allocs to one txt file and the regular to another and then cut paste to combine them in Excel.

     

    There's probably an easier way to do this import, but this works well enough.

    image

    You can really go crazy getting the data all pretty but don't bother as even something as ugly as this is useful because a quick sort on Column B shows the allocation size pattern.  Note the Text Import Wizard allows you to skip columns here, so you can import only the address column and the size column.  i.e. faeae000’s column and the 0x4000 column in the above.

     

    image

    Via Excel I just generated another table on the fly to keep track of our totals here.  Note, I’m just counting the number of each size of alloc by visual inspection.  i.e. There are 9 allocations of 0x1000 size in my list.

     

    Allocation Sizes in Bytes (h)  

    Number of Allocations

    Total MmCm in Bytes

    0xf18 bytes 

    2014

    7,782,096

    0x1000 bytes

    9

    36,864

    0x2000 bytes

    1017

    8,331,264

    0xf000 bytes

    23

    1,413,120

    0x11000 bytes

    92

    6,406,144

    0x156000 bytes

    2

    2,801,664

     

    Ahhh…..In total we've got 26 of 28 Megs sampled here, well representing the MmCm usage minus some insignificant allocs and rounding, great.

     

    (I love the new Win7 calc.exe)

    clip_image003

    So how does this help me again?  Here's the crucial step.  Look at samples of these allocations (which is why I kept the address of the alloc in the Excel import).  There are very often telltale tags or strings inside the alloc that give you a strong indication if not proof positive who allocated this memory.

     

    Here's a common allocation example:

     

    Dump several samples of the f18 sized allocations…

     

    0: kd> dc fdf18000 fdf18000+f18

    fdf18000  0be30000 6d436d4d fd1df008 00000000  ....MmCm........<--the pooltag, okay…knew that...

    fdf18010  1f2e3d4c 00000000 00000000 00000000  L=..............<-------hum, this 1f2e3d4c seems to be in every one of these puppies...

    fdf18020  00000000 00000000 00000000 00000000  ................

    fdf18030  00000000 00000000 00000000 00000000  ................

    fdf18040  00000000 00000000 00000000 00000000  ................

     

    Remember we learned that these are predominantly allocated on boot?  Turns out that Storport actually allocates 1000 of these(per adapter) based on the extension size of the adapter and guess what it does, it puts a handy little tag in here defined as 1f2e3d4c.    Here's the proof.

     

    storport!RaInitializeRaidResources+0x47:

    f5d6aeb3 8bd7            mov     edx,edi

    f5d6aeb5 8bce            mov     ecx,esi

    f5d6aeb7 ff1510d1d6f5    call    dword ptr [storport!_imp_InterlockedPushEntrySList (f5d6d110)]

    f5d6aebd ff4508          inc     dword ptr [ebp+8]

    f5d6aec0 8b4508          mov     eax,dword ptr [ebp+8]

    f5d6aec3 3b450c          cmp     eax,dword ptr [ebp+0Ch]

    f5d6aec6 c747084c3d2e1f  mov     dword ptr [edi+8],1F2E3D4Ch<------------------ah ha!

    f5d6aecd 72be            jb      storport!RaInitializeRaidResources+0x21 (f5d6ae8d)

     

     

    Cool! So now I've accounted for 7.8 Megs of the 28, and I know at this point that since I've got ~2000 allocs I've got at least two storage adapters on this machine.

    Actually, If you are suspicious about the roughly 1000 0x2000 sized allocations and you think it's another adapter with a larger extension size, you'd be right!

     

    0: kd> dc fcc20000

    fcc20000  fcc22000 00000000 1f2e3d4c 00000000  . ......L=......<-------again!

    fcc20010  f7b7e000 00000000 00000000 00000000  ................

    fcc20020  00000000 00000000 00000000 00000000  ................

    fcc20030  00000000 00000000 00000000 00000000  ................

    fcc20040  00000000 00000000 00000000 00000000  ................

    fcc20050  00000000 00000000 00000000 00000000  ................

    fcc20060  00000000 00000000 00000000 00000000  ................

    fcc20070  00000000 00000000 00000000 00000000  ................

     

     

    Now, I've accounted for likely all my storage adapters (7,782,096 + 8,331,264  =  16,113,360) which are between the three of them consuming 16 of 28 Megs!

    (I say three because of the ~3000 allocations in total, with 1000 per adapter as stated.  So two adapters of size 0xf18 and one of size 0x2000)

    Sure enough checking msinfo32 on this machine shows three adapters.

     

    So what about the rest of the odd ball sized allocations and 12 Megs?  Especially the 92 0x11000 and the two whopping 0x156000 bytes ones?

     

    First , the two whopper 0x156000 byte allocs.  Here's where you are limited usually only by your own creativity and patience...( this particular sample is courtesy of a fellow persistent debugger,  Pushkar)

     

    //Dump the beginning of one…

     

    0: kd> dc fdb24000

    fdb24000  00036c40 fdb24080 fdc43100 08b24080  @l...@...1...@..

    fdb24010  00000000 08c43100 00000000 00000001  .....1..........

    fdb24020  00000000 00000000 00000000 00000000  ................

    fdb24030  00000000 00000000 00000000 00000000  ................

    fdb24040  00000000 00000000 00000000 00000000  ................

    fdb24050  00000000 00000000 00000000 00000000  ................

    fdb24060  00000000 00000000 00000000 00000000  ................

    fdb24070  00000000 00000000 00000000 00000000  ................

     

    On the line below with the highlight we are looking for interesting pointers and !pool'ing them to figure out who's associated with this alloc.

     

    0: kd> dc

    fdb24080  00000003 00000940 08b24100 00000000  ....@....A......

    fdb24090  fdef0080 00000000 08b241d8 00000000  .........A......<-----This is the line

    fdb240a0  00000080 08b25080 00000000 00000080  .....P..........

    fdb240b0  08b35080 00000000 00000080 08b29080  .P..............

    fdb240c0  00000000 00000080 08b37080 00000000  .........p......

    fdb240d0  00000080 08b2d080 00000000 00000080  ................

    fdb240e0  08b39080 00000000 00000080 08b31080  ................

    fdb240f0  00000000 00000080 08b3b080 00000000  ................

     

    //Oh, it happens to be a Device object...

     

    0: kd> !pool fdef0080

    Pool page fdef0080 region is Nonpaged pool

    *fdec1000 : large page allocation, Tag is Dev., size is 0x33000 bytes

    Owning component : Unknown (update pooltag.txt)

     

    //Dump it out via devobj, adding the object offset and poolheader length…this one looks like it's going to belong to my Emulex adapters

     

    0: kd> !devobj fdec1000+38

    Device object (fdec1038) is for:

    ElxPlus*** ERROR: Module load completed but symbols could not be loaded for elxplus.sys

    \Driver\elxplus DriverObject fdef41c0

    Current Irp 00000000 RefCount 1 Type 0000002a Flags 00000048

    DevExt fdec10f0 DevObjExt fdef3ca0

    ExtensionFlags (0000000000) 

    AttachedTo (Lower) fdf68ba0 \Driver\PnpManager

    Device queue is not busy.

     

    On to the 0x11000 sized, first pass is to just dc out the memory like before and look for interesting tags or strings…At the end of Network card adapter driver allocations via their calls into the Ndis layer for instance, you can flag them by the END of the allocation may have an ND** string.  I don't see any of those here but it "looks like" the 0x11000 sized allocations have network related data and strings as a common factor throughout, so I'm guessing these are associated with one of the four network adapters on this machine.  Turns out if you sample more of these they may have pointers back to other interesting pool allocations that can flag the network miniport adapter, etc.

     

    You may be thinking, hum…I get the storport one but these last two sound like mere correlation?  Absolutely.  However, given the lack of randomness sampling multiple allocations it's often strong enough to at least inform your hypothesis and test, quickly.

     

     

    Yes, there is an easier way!

     

    If this memory is allocated at driver initialization and most drivers initialize at system boot time, it stands to reason that most of this memory will be consumed on boot.  Cool!  Then it also stands to reason we can use our friend Poolmon.exe to quickly check how much is being used shortly after boot and do some quick testing by excluding some of the usual suspect adapters.  First, you could just trust me and check for ancillary storage, NIC, and enhanced video adapters and disable/remove these first as a low risk test.

     

    In any case, the testing sequence is:

     

    1.      Promptly record the amount of MmCm after a reboot via Poolmon.exe

    (Sort by Non Paged Pool (toggle through by hitting 'P') and then descending by bytes (same, but 'B')

    2.      Disable the ancillary test adapter(s)

    3.      Reboot, check poolmon.exe again

    4.      What's the decrease?

     

    5.      Perhaps repeat testing with updated drivers or disabled features if you must have the additional hardware in place (hoping they use less MmCm).

     

    I hope this post saved you time by being wary of configuring more adapters than absolutely necessary, especially /3GB x86 Exchange servers and if you have to reactively triage this a quick and dirty way to determine the primary consumers of this memory.

     

    Enjoy!

     

    -Tate

     

    P.S. If you can't scale down the number or features of let's say one of the suspect adapters, do check with your driver vendor to see if there is a way to throttle back the consumption.  For instance, when we shipped the Scalable Networking Pack feature, we often saw higher consumption in the NIC drivers supporting this functionality, later driver releases from hardware vendors reportedly used much less.

     

     

     

     

    Bonus:

     

    Here's another handy debug trick to isolate who can call these two Memory manager functions and narrow down the driver population...

     

    //Find the address of MmAllocateContiguousMemorySpecifyCache

    0: kd> x nt!MmAllocateContiguousMemorySpecifyCache

    e080efac nt!MmAllocateContiguousMemorySpecifyCache = <no type information>

     

    //Use the handy !for_each_module to scan each module for the address (e080efac )

    0: kd> !for_each_module ".echo ${@#ModuleName} ;s -d ${@#ModuleName} L?${@#Size} e080efac"

     

    //Hits will look like this, with the matching address listed…

    ATMFD

    nt

    e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

    hal

    e0a81044  e080efac e0806d12 e082e237 e082ddc7  .....m..7.......

    RDPWD

    .

    .

    .

    ati2mtag

    f5575234  e080efac e083b69c e08454c9 e082f651  .........T..Q...

    f5836084  e080efac e08329f9 e0839c13 e08e2d78  .....)......x-..

    .

    .

    .

    ql2300

    f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    storport

    f5d6d174  e080efac e0806d12 e08412f9 e08eaeb1  .....m..........

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    elxstor

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

    SCSIPORT

    f5e170cc  e080efac e08d15a3 e08e8543 e0815d48  ........C...H]..

     

     

    //Example match in the import table for ati2mtag...

    0: kd> dps f5575234

    f5575234  e080efac nt!MmAllocateContiguousMemorySpecifyCache

    f5575238  e083b69c nt!ZwQueryInformationProcess

    f557523c  e08454c9 nt!PsGetCurrentThreadProcessId

     

     

    //Repeat for the other call...

    0: kd> x nt!MmAllocateContiguousMemory

    e080be42 nt!MmAllocateContiguousMemory = <no type information>

     

     

     

     

    Share this post :

     

     

     

     

  • Ntdebugging Blog

    Bcdedit Tips and Tricks For Debugging Part 1

    • 6 Comments

    Hello everyone, my name is Sean Walker, and I am on the Platforms OEM team in Washington.  This article is for those people who have had a hard time switching from the old boot.ini configuration to the new BCD store (myself included). Doing the simple tasks such as enabling kernel debugging over com1 are easy to do with bcdedit.exe or the msconfig GUI, you just enable them and reboot the computer. However, if you need to do something more advanced such as break into the early boot process during resume from hibernation, things get a lot more complicated.

     

    This article has some samples for enabling and disabling debug settings that you may not be familiar with, and a list of bcdedit debug settings for Windows Vista/Server 2008 and Windows 7/Server 2008 R2.  This information has been helpful to me for quickly and accurately getting to the debug at hand rather than fumbling around with bcdedit.  Much of the following information has been taken from various sources, including the windbg help files, the OEM team blog, the MSDN bcdedit reference, and the WHDC debugger site.

     

    NOTE: For the examples below, you will need to run bcdedit.exe from an administrator (UAC-elevated) command prompt.  To output a summary view of the current state of the BCD store, just run "bcdedit.exe" from the command prompt.  To get detailed information about all of the store(s) that Windows knows about, use the following command:

    bcdedit /enum all

     

    What is a BCD store?

    A BCD store is a binary file that contains boot configuration data for Windows, basically it is a small registry file.  Boot applications use the system BCD store, located on the system partition, during the boot process.  You can also create additional BCD stores in separate files but only one store at a time can be designated as the system store.

     

    NOTE: The "/store" switch can be used to specify a particular BCD store for bcdedit commands (instead of the default store).  To enumerate all the settings in another BCD store, in this case e:\bcd_store\BCD, use the following command:

    bcdedit /store e:\bcd_store\BCD /enum all

     

    This will show you which options are currently set, and what their values are.  When /store switch is omitted, the system store is used.

     

    Using bootdebug

    To enable debugging for early boot problems, you may need to enable the bootdebug switch.  This is easy to do with bcdedit:

    bcdedit /set bootdebug on

     

    However, this only sets bootdebug for the current "boot application", which is generally winload.exe, so it does not break into the very early boot process.  There are multiple applications used for booting, hibernating, and resuming (bootmgr.exe, winload.exe and winresume.exe are examples of these).  Each application (called BCD Objects) has its own settings (called BCD Elements) in the BCD store and each can be modified globally and/or individually.

     

    So, to deal with different (or multiple) debug scenarios, you just enable boot debugging based on the boot application you are concerned with.  For early debugging, you can enable bootdebug for bootmgr:

    bcdedit /set {bootmgr} bootdebug on

     

    To set bootdebug for winload.exe (which will most often be your current, and default, boot object) all three of the following will give you the same result:

    bcdedit /set bootdebug on

    bcdedit /set {current} bootdebug on

    bcdedit /set {default} bootdebug on

     

    If you are modifying the settings in another store, or are booted into another OS on the same computer (such as WinPE), you need to specify the location of the BCD store:

    bcdedit /store d:\Boot\BCD /set {default} bootdebug on

     

    Not all of the boot objects have "friendly" names, so you may need to specify the full GUID (Globally Unique ID) to modify it.  As an example, if you wanted to enable bootdebug on resume from hibernation, you would include the identifier (see figure 1) for the "Resume from Hibernate" object:

    bcdedit /set {89a932d0-d5bc-11e0-a0af-00215add5ebc} bootdebug on

     

    image001

    Figure 1: Color coded bcdedit output

     

    Why won't my USB or 1394 debug work?

    When there are multiple debug ports of a certain type in a computer Windows may not default to the correct one for your situation.  This happens most commonly when there are either multiple 1394 host controllers or USB EHCI controllers.  When this occurs it can range from a slight inconvenience (different port is used so the cable needs to be plugged into another port), to complete failure (internal port is used, which is not accessible).  In the case of USB debugging the Intel USB 2.0 specification only provides one debug port, so debugging is not possible if the wrong host controller is used.

     

    There are several caveats with USB debugging, not the least of which is that you need to buy a separate, expensive, debug cable.  Some of the difficulties and implementation details necessary to get USB debugging to work are encompassed in the WHDC USB FAQ and in Setting Up Kernel Debugging with USB 2.0.

     

    NOTE: A correction to the WHDC USB documentation for Windows 7/Windows 2008 R2 is that the busparams switch now takes decimal rather than hexadecimal values, and the "loadoptions" parameter is no longer required.  So, to enable the busparams element (for USB or 1394 debugging) in Vista/2008, you would use something like this:

    bcdedit /set {current} loadoptions busparams=0.1D.7

     

    And the Win7/2008 R2 example would be:

    bcdedit /set {current} busparams 0.29.7

     

    In the case of loadoptions or busparams, deleting the setting is not as easy as changing a flag from yes to no. You must specifically delete the value to get rid of it, and one of the examples below can be used:

     

    For Vista/2008:

    bcdedit /deletevalue {current} loadoptions

     

    And Windows 7/2008 R2:

    bcdedit /deletevalue {current} busparams

     

    Bcdedit settings and examples

    This is just scratching the surface of using bcdedit for your troubleshooting and/or debugging needs, so there are more articles to follow. Part 2 will include some more detailed debugging scenarios, such as Hyper-V guest and host debugging.  Below is a consolidated table with many of the debugging switches/settings as well as a number of different usage examples.

     

    Table of debug-related bcdedit settings

    Option

    Description

    bootdebug

    Enables or disables the boot debugger for a specified boot entry. Although this command works for any boot entry, it is effective only for boot applications.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Bcdedit /set bootdebug on

    debug

    Enables or disables the kernel debugger for a specified boot entry.

    Enable value(s): on, 1

    Disable value(s): off, 0

    /dbgsettings

    Used to modify the global settings for the debug connection (does not include hypervisor).  Values:

    Can change all settings at once instead of using the /set command to change them individually. Usage example:

    bcdedit /dbgsettings 1394 channel:30

    debugport

    Used to specify the debugger type.

    Values:

    Serial port – com1, com2, comx

    1394 port – 1394

    USB port - USB

    channel

    Specifies 1394 channel used.

    Values:

    Decimal integer between 0 and 62, inclusive.

    baudrate

    Used to specify the baud rate of a serial debug port.

    Values: 9600, 19200, 38400, 57600, 115200

    targetname

    Specifies a string to use as the identification for the USB 2.0 connection. This string can be any value.

    Usage example:

    bcdedit /dbgsettings usb targetname:usbdebug

    /hypervisorsettings

    Used the same way as /dbgsettings to configure all settings at once.

    Usage example:

    bcdedit /hypervisorsettings 1394 channel:10

    hypervisordebug

    Enables or disables hypervisor debug mode. This is for debugging a Hyper-V host system.

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    bcdedit /set {current} hypervisordebug on

    /noumex

    Specifies that the kernel debugger ignores user-mode exceptions. By default, the kernel debugger breaks for certain user-mode exceptions, such as STATUS_BREAKPOINT and STATUS_SINGLE_STEP. The /noumex parameter is effective only when there is no user-mode debugger attached to the process.

    /start

    This option specifies the debugger start policy. If a start policy is not specified, ACTIVE is the default.

    Values: active, disable, autoenable

    loadoptions

    Used to describe settings that are not covered by other types. One setting that is relevant here is busparams.

    Values: Any value followed by the setting.

    Usage example (Vista/2008):

    bcdedit /set {current} loadoptions busparams=0.1d.0

    busparams

    A boot setting (specified with loadoptions key word) used to point to the PCI address of the debugger in use. The PCI bus, device, and function are used, in the format bb.dd.ff. This is generally used to identify the location of a 1394 or USB debug port. In Vista/2008, hexadecimal values are used, whereas decimal values are used for Win7.

    Values: Decimal values between 0 and 255.

    Usage example:

    In Win7 - bcdedit /set busparams 0.29.0

    In Vista - bcdedit /set loadoptions busparams=0.1d.0

    kernel

    The loadoptions parameter used to point to a different kernel binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: The 8.3 filename of the replacement kernel include the exe extension.

    Usage examples:

    In Win7 – bcdedit /set kernel kernchk.exe

    In Vista - bcdedit /set loadoptions kernel=kernchk.exe

    hal

    The loadoptions parameter used to point to a different hal binary. This can be used to test with a checked or instrumented version of the kernel without replacing the existing one. The updated binary MUST be placed in the %windir%\system32 folder to be used

    Values: the 8.3 filename of the replacement kernel include the .dll extension.

    Usage examples:

    In Win7 – bcdedit /set hal halchk.dll

    In Vista - bcdedit /set loadoptions hal=halchk.dll

    testsigning

    Controls whether Windows 7, Windows Server 2008, or Windows Vista will load any type of test-signed kernel-mode code. This option is not set by default, which means test-signed kernel-mode drivers on 64-bit versions of Windows 7, Windows Server 2008, and Windows Vista will not load without setting the testsigning switch

    Enable value(s): on, 1

    Disable value(s): off, 0

    Usage example:

    Bcdedit /set testsigning on

     

  • Ntdebugging Blog

    Server Hangs with Event ID: 2021 and 2022

    • 3 Comments

    Hi again!  This is Tate from the CPR team and I’m going to show you how to debug a Server Service hang and the sometimes dreaded Event ID:  2021 and Event ID:  2022.  There is much Voodoo about troubleshooting these two events but never fear, it’s possible to debug quickly given the right approach.

     

     

    Recall, I’ve talked about two other common Server Service events prior, 2019 and 2020.  (Event ID: 20xx usually denotes a srv.sys series event) Also, from now on I’m going to refer to the Server Service as srv. It’s in the driver srv.sys that the bulk of the work is happening and as such that’s just what we call it.

     

    Anyway, if you have seen these events before you likely have queried the web with the ID or description and have seen the big KB317249 article.

    This is an oldie but goodie, but the problem is that, well it’s big and it would help to have perspective on how srv is supposed to work because all the root causes for the event seem quite varied…but often there is only one of two things happening here.

     

    1.)  Blocking in the I/O stack preventing normal amounts of SMB requests from getting processed in a timely fashion. (this is the most common that we see, the SMB pump failure case)

     

    2.)  Abnormally large amount of requests directed at srv(this is usually a really chatty SMB application or usually accumulation of ChangeNotify requests).

     

    I’ll show you how to determine which of these two are happening but first let’s start with the event text as a refresher…

     

    Event ID: 2022
    Source: Srv
    Description: Server was unable to find a free connection n times in the last s seconds.

    Event ID: 2021
    Source: Srv
    Description: Server was unable to create a work item %2 times in the last %3 seconds.

     

     

    What does this mean?

     

    We really have to start with an explanation of Connections and Work Items.  So, internally to srv these are just structures used to represent a connection into srv and an item of work…couldn’t be that straightforward right?  It is, when a request comes into srv it has to use an Endpoint, Connection, a Work Item, and other structures to know about where the request came from, what the work is, and how to get the response back to what client.

     

    At its core, srv is just a Kernel Mode Driver that does SMB processing.  It takes in network SMB/CIFS requests, converts them into local I/O, and pumps the return back out to the client.  But the problem here is why would you be “unable to find” or “unable to create” one of these or what breaks this pump/processing?

     

     

    But first, let’s comment on storage.  Kernel memory is not free and as mentioned in prior blogs these structures take up space in the kernel pools and or may have preconfigured srv nominal limits (see MaxWorkItems in KB317249, etc.) if you hit either of these limits the event will be logged.  So as we learned prior running out of Paged or NonPaged pool can hang a machine no problem but can also log these events should Srv not be able to allocate memory on behalf of client requests.  (See the Event ID:  2019 2020 blog for that info and troubleshooting).  What usually happens here though are these WorkItems build up due to blocking or high load and cause the Event ID:  2019 or 2020…(with the top consumer usually LSwn – LanmanServer WorkItems) and then you will see these along with the 2021/2022s.  An event party if you will with the 2021/2022 issue in srv actually exhausting kernel memory pools.

     

    Could you bump up the limits of everything in \LanmanServer via KB317249 not read further and be okay?

     

    Maybe, but it depends how much blocking you may have or how high the load gets before the point of failure.  If you are logging this even after the server has been working fine under your normal ebb and flow of load in the past, it may be likely that this article will not completely fix your problem so further work is needed.  Keep in mind too that if you bump up the limits in the registry per the above explanation you could actually hit the pool limits sooner!  The values in KB317249 are pretty conservative so they shouldn’t hurt but again it might not completely help.  Also, if you don’t like changing things until you understand more…read on!

     

     

     

    REASON #1 for Event ID:  2021 and 2022

     

    The pump mentioned above is really the Srv!WorkerThread and KQUEUE architecture.  When requests come in they are put in KQUEUEs via KeInsertQueue and as you might imagine, the processing/draining of those queues is done by the Srv!WorkerThread(s) via a call to KeRemoveQueue.  These threads are the heart of Srv and are the key to determining what is usually happening to break the SMB processing here, so we usually look there for the answer.

     

    To check in on srv you gather a kernel memory.dmp via KB244139 during the problem or with livekd (Sysinternals) perhaps…but the best way is KB244139 or by attaching the kernel debugger.  The best time to gather this dump is when the server is somewhat unresponsive via the network (SMB) or you are constantly logging the 2021/2022 events…or if the machine is completely hung after hitting these Events.

     

    With memory.dmp in hand, just load it up and do the following to get the system process’ EPROCESS address via !process 0 0 system

     

    0: kd> !process 0 0 system

    PROCESS 89067520  SessionId: 0  Cid: 0008    Peb: 00000000  ParentCid: 0000

        DirBase: 00030000  ObjectTable: 89067008  TableSize: 490.

        Image: System

     

    Now change process context using that address with .process /r /p <system EPROCESS address>

     

    0: kd> .process /r /p 89067520

    Implicit process is now 89067520

    Loading User Symbols

     

    Now let’s dump out all the system threads using !process <system EPROCESS address>

     

    0: kd> !process 89067520

    PROCESS 89067520  SessionId: 0  Cid: 0008    Peb: 00000000  ParentCid: 0000

        DirBase: 00030000  ObjectTable: 89067008  TableSize: 490.

        Image: System

    .

    .

    Dumping out the threads…

    .

    .

     

    Once the output of threads is finished, I like to search up from the bottom for “Start Address srv!WorkerThread” looking at the call stacks…

     

    Note:  There’s a shortcut here for Windows 2003 to just do a “!stacks 2 Srv!WorkerThread” right away.  This output will give you a really quick look at the threads without having to .process into system, etc.

     

    Here’s an example of a “good thread” just waiting for work to come in…sitting in KeRemoveQueue

     

     THREAD 887c14e0  Cid 8.474  Teb: 00000000  Win32Thread: 00000000 WAIT: (WrEventPairLow) KernelMode Non-Alertable

                be9fd324  Unknown

            Not impersonating

            Owning Process 89067520

            Wait Start TickCount    160832116     Elapsed Ticks: 203335

            Context Switch Count    134294              

            UserTime                  0:00:00.0000

            KernelTime                0:00:09.0343

            Start Address srv!WorkerThread (0xbea06880)

            Stack Init beb11000 Current beb10d3c Base beb11000 Limit beb0e000 Call 0

            Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0

     

            ChildEBP RetAddr  Args to Child

            beb10d54 8042ef5f 88a10eb8 be9fd320 00000001 nt!KiSwapThread+0x1b1

            beb10d78 bea068ec 00000000 00000000 00000000 nt!KeRemoveQueue+0x197

            beb10da8 804578c2 be9fd300 00000000 00000000 srv!WorkerThread+0x6c

            beb10ddc 8046c966 bea06880 be9fd320 00000000 nt!PspSystemThreadStartup+0x54

            00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

     

     

    What does a “bad thread” look like?

    Here is a blocked Srv!WorkerThread that is trying to process a FindFirst SMB request…

     

    THREAD 887dd4e0  Cid 8.45c  Teb: 00000000  Win32Thread: 00000000 WAIT: (UserRequest) KernelMode Non-Alertable

                88255db8  SynchronizationEvent

            IRP List:

                86941288: (0006,01fc) Flags: 00000884  Mdl: 00000000

            Impersonation token:  e5d56990 (Level Impersonation)

            Owning Process 89067520

            Wait Start TickCount    160783273     Elapsed Ticks: 252178

    NOTE the Tick count here on this thread….it’s been waiting a while, not good.

            Context Switch Count    12993              

            UserTime                  0:00:00.0000

            KernelTime                0:00:00.0843

            Start Address srv!WorkerThread (0xbea06880)

            Stack Init bec29000 Current bec284f4 Base bec29000 Limit bec26000 Call 0

            Priority 10 BasePriority 9 PriorityDecrement 0 DecrementCount 0

     

            ChildEBP RetAddr  Args to Child

            bec2850c 8042d893 00000000 882549c8 00000000 nt!KiSwapThread+0x1b1

            bec28534 f5ae3408 88255db8 00000006 00000000 nt!KeWaitForSingleObject+0x1a3

            bec285bc f5ae4066 bec285e4 00000000 00000000 SomeFiltr+0x3408<<<<<<<making a blocking call here…

            bec285fc f5ae19fb 884ce3a0 01941288 8041eecb SomeFiltr+0x4066

            bec287a4 804c4994 884ce3a0 00000000 bec28894 SomeFiltr+0x19fb

            bec287dc 804531c2 884766c8 00000000 bec28894 nt!IopParseFile+0x44

            bec28854 804da4d8 0000051c bec28950 00000040 nt!ObpLookupObjectName+0xf8

            bec28964 804a4495 00000000 00000000 00000000 nt!ObOpenObjectByName+0xc8

            bec28a40 804a403a 8690d008 00000001 bec28b4c nt!IopCreateFile+0x407

            bec28a88 bea0282c 8690d008 00000001 bec28b4c nt!IoCreateFile+0x36

            bec28af8 bea050f6 88a0d2f8 8690d008 00000001 srv!SrvIoCreateFile+0x32a

            bec28b8c bea0acdb 88a0d2f8 00000001 8046b000 srv!SrvQueryDirectoryFile+0x1c8

            bec28c5c bea0936d 88a0d2f8 00000001 00000000 srv!SrvFind2Loop+0x171

            bec28cd0 bea07d2f 88a0d201 bea0a9c8 88a0d2f8 srv!DoFindFirst2+0x373

            bec28cd8 bea0a9c8 88a0d2f8 88a0d2f8 804636f0 srv!SrvSmbFindFirst2+0x31

            bec28cf0 bea089b7 88994f38 88a0d2f8 00000000 srv!ExecuteTransaction+0xdc

            bec28d68 be9f27ae 88a0d2f8 be9f2cde 88a0d300 srv!SrvSmbTransaction+0x5af

            bec28d70 be9f2cde 88a0d300 88a144a0 bec28da8 srv!SrvProcessSmb+0x46

            bec28d84 bea069a9 00000000 00000000 00000000 srv!SrvRestartReceive+0xa2

            bec28da8 804578c2 88a14400 00000000 00000000 srv!WorkerThread+0x129

            bec28ddc 8046c966 bea06880 88a144a0 00000000 nt!PspSystemThreadStartup+0x54

            00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

     

    Note:  The tick count (Elapsed Ticks) was not bad for the prior “good thread” because we’ve just been waiting in the call to KeRemoveQueue for that long for work to arrive.  No work to do in that queue, no problem.  However, in this “bad thread “case we’ve grabbed a WorkItem to process and been waiting for a long time (about 1hr 5 min.) stuck while processing it, yikes.

     

     

    Why is this bad?

     

    So if the Srv!WorkerThread is the worker thread that drains the WorkItems from this list of request in the KQUEUEs and number of these threads are finite(they are), then we have a problem!  As more requests come into srv they will continue to get placed in the KQUEUEs by different event handler threads at the transport layer but with no WorkerThread available to process the queue on the back end (and considering these Connections and WorkItems are stored in Pool and the number is finite as well) this is how and usually why the pump stops and we log these events!  So not only might we run out of the configured limits in the registry (KB317249) for srv, we might run out of Non Paged Pool, and we are definitely not giving clients a timely response if at all in this state.

     

    What if I see “good” and “bad” threads in my !process of the system process?

    This is likely because of the two queues that we have in service.  There are blocking and a non-blocking queues so you may see the threads that are waiting in KeRemoveQueue are actually for the blocking queue whereas your other bad threads are blocked on things like CreateFile, Find, etc. requests…supposed to be shorter time requests, appropriately so, on the the non-blocking work queues.  In general, if you see any Srv!WorkerThread blocked in anything but KeRemoveQueue for very long, it’s bad.

     

     

    So what is the solution?

     

    Referencing the above “bad thread” for SomeFiltr.sys note that the filter driver in question has chosen to make a blocking call to KeWaitForSingleObject.  This is blocking a FindFirst request from a client but more importantly it’s consuming an Srv!WorkerThread in the process as stated….so it’s actually the filter driver making the decision to stop the processing so the fault lies with it in this case.

     

    Now it’s worth mentioning that this is a fairly simple 1st tier blocking example but suffice to say any blocking call taking too long in a srv!WorkerThread can have bad consequences.  This could have easily been any filter driver or device driver in the I/O stack (between the network and srv and the disk) which is blocking doing some work to “help” process the I/O or more difficult to debug the filter or function driver could be waiting on some other queuing mechanism that it implements…so you see how srv can encounter a dependency failure here.

    Another trick that we use is looking at all other thread stacks containing the name of the filter driver to see if it’s doing something recognizable like waiting on another Kernel api, filter, etc…in other words guessing through the black box of it’s decision to wait on this thread and making an educated guess of what work it is trying to do on another…A common case is when the blocked thread may be waiting on a response from an LPC to LSASS which is performing some security check. Then it’s necessary to jump …..

     

    In summary, since we can’t see further into the blocking here through SomeFiltr.sys and there’s no other threads in process with this driver in the call stack we’d recommend testing updating, removing, or contact the vendor of the SomeFiltr.sys filter to see if this is fixed, etc.  You can do an “lmvm SomeFiltr”  to get the time/date information for the file.

     

     

    So, what if all the threads are just waiting like in the good thread example?

     

    This could be the unfortunate side of gathering a single memory.dmp to represent a problem that may be transient.  If a fraction of a second before you hit the Ctrl+Scroll+Scroll to issue the dump the Srv!WorkerThread unblocks, it may look completely normal.  What you can do for this class of issue that has temporary blocking is either try again to get it in state, or have a look at the device stack that srv has to work through to get things done(see devstack below).  Usually you will find out what filter drivers you are running, etc. and since that is the #1 reason for this event updating them is usually the solution because we or the vendor have likely already debugged this for you and with the filter vendor’s help have likely fixed it in a later version.

     

    This is why usually the first thing we target is filter drivers before debugging anything for hangs or if a dump is not available.  By frequency this is far and away the usual root cause for hangs of many many (did I say many?) symptom types.

     

    How do I check the members of device stack that Srv.sys has to work through to get I/O requests processed, filter drivers I need to update, etc?

     

    This is cool…so did you notice that there was an Irp listed in the prior “bad thread”?

     

    Dump it out…

     

    0: kd> !irp 86941288

    Irp is active with 11 stacks 11 is current (= 0x86941460)

     No Mdl Thread 887dd4e0:  Irp stack trace. 

         cmd  flg cl Device   File     Completion-Context

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

     [  0, 0]   0  0 00000000 00000000 00000000-00000000   

     

                                                    Args: 00000000 00000000 00000000 00000000

    >[  0, 0]   1  0 884ce3a0 878977a8 00000000-00000000   

                           \FileSystem\SomeFitr

                                                    Args: bec28778 01000000 00070000 00000000

     

    Now grab the Device Object for SomeFiltr…

     

    0: kd> !devstack 884ce3a0

      !DevObj   !DrvObj            !DevExt   ObjectName

    > 884ce3a0  \FileSystem\SomeFiltr884ce458 

      88ba5c80  \Driver\InnocentDriver       88ba5d38 

      88ba5460  \FileSystem\Ntfs   88ba5518 

     

    Now you can see how this thing works….ntfs talks to InnocentDriver and it to SomeFiltr, etc….all these guys have to work flawlessly to not kill the srv request dependent upon this device stack.

     

    What if you didn’t have an Irp, no problem…everthing hangs off Object Manager so start at Ntfs’s device objects…This is cooler…

     

     

    0: kd> !drvobj \FileSystem\Ntfs  <<<<note that’s !DRVobj for the Driver Object

    Driver object (88dd2710) is for:

     \FileSystem\Ntfs

    Driver Extension List: (id , addr)

     

    Device Object list:

    882b4120  8717a620  88b91800  88b981e0

    88b9e600  88ba5460  88f583e0  88dd25f0

     

    0: kd> !devstack 882b4120

      !DevObj   !DrvObj            !DevExt   ObjectName

      86576020  \FileSystem\SomeFiltr865760d8  <<<<<<there you are Mr. SomeFiltr

    > 882b4120  \FileSystem\Ntfs   882b41d8 

     

    0: kd> !devstack 8717a620

      !DevObj   !DrvObj            !DevExt   ObjectName

      8792ae00  \FileSystem\SomeFiltr8792aeb8  <<<<<<and here

    > 8717a620  \FileSystem\Ntfs   8717a6d8 

    .

    .etc…

    .

    0: kd> !devstack 88dd25f0

      !DevObj   !DrvObj            !DevExt   ObjectName

      886c9020  \FileSystem\SomeFiltr886c90d8  <<<<<<<and here

      88f74f00  \FileSystem\DfsDriver88f74fb8 

      88f74020  \Driver\InnocentDriver       88f740d8 

    > 88dd25f0  \FileSystem\Ntfs   00000000  Ntfs

     

    The prudent thing to do if you aren’t sure of root cause by debugging the Srv!WorkerThreads you have captured in a memory.dmp is at least make sure all the filter drivers in these stacks are updated and perhaps even your storage drivers that are being used under ntfs serving your file share(s).

     

     

    REASON #2 for Event ID:  2021 and 2022

     

    What we have just discussed was the most common reason for hitting the Event ID:  2021 and 2022 event in blocking by a filter or function driver on the I/O stack and how to identify.  There are of course other reasons like bursts of demand and a lower than the maximum configured Srv (the nominal limits mentioned above and discussed in KB317249) or just simply a lot of long term requests that are on the system such as ChangeNotify requests.  How would you find out about these?  Analyze the network traffic.

     

    There are tons of network sniffers which allow you to gauge what type of SMB traffic is going in between servers so I won’t get into that here.  I will only say that if you have the ability to generate Statistics of top SMB commands, that’s the way to go.  You could however just filter on SMB and ballpark what kind of request and from which clients are occurring most and given your client configuration if that’s normal.  What we see here usually is a misbehaving network application in a loop, etc. hitting the file server with repeated requests for the same file, etc.  This is also a great strategy for high system process CPU issues where when you look at the thread call stacks generating the CPU via Process Explorer (sysinternals) you find the base of the activity in Srv!WorkerThreads coming and going rapidly.

     

    I will mention however there are a few known ways to reduce the cumulative load of ChangeNotifications and QueryFile and QueryPathInfo requests that Windows Clients actually generate.  You can check out the following KBs for these configuration changes on the client side:

     

    For the client SMB side’s shell:  NoRemoteChangeNotify.

    http://support.microsoft.com/kb/812669/en-us

    “An update is available in Windows to turn off directory notification SMB requests”

     

     

    For the client SMB side (redirector):  InfoCacheLevel

    (also has shell change NoRemoteRecursiveEvents )

    http://support.microsoft.com/kb/816375/en-us

    “Windows XP Explorer Pane flickers on mapped network drives”

     

     

     

    I hope you’ve enjoyed this post and hopefully the next time you see these srv events you can nail the root cause with windbg no problem!

     

    Keep that device stack updated and until next time…

  • Ntdebugging Blog

    Windows Internals Beta Exam 71-660

    • 13 Comments

    Correction, Windows Internals Beta Exam 71-660

    This was previously listed incorrectly as 70-660, the actual exam ID is now 71-660

     

    Update 7-18-2008: The link to the exam preparation information has been updated below.

     

    Update 7-21-2008: When registering, there is no need to use a credit card.  Be sure to instead use the following promotional code listed below when registering for the exam: WINT.  If you have already been charged for this exam, please contact Prometric and request a refund, and reference ticket number 3711938.

     

    Hello Windows Debugging Community,  We are super excited to announce a new certification.   With the help of industry partners, including some of our very talented community members,  we have developed a new Windows Internals Certification.   This Certification targets developers and IT professionals that require knowledge of Windows Internals as part of their profession.   Chance are if you are reading this blog post you are part of the target audience.  I encourage you to register and please take the Beta exam.  Please help us to improve the exam by providing feedback during the beta exam.

     

     

    Register for Beta Exam 71-660: TS: Windows® Internals

     

    You are invited to take beta exam 70-660: TS: Windows® Internals. If you pass the beta exam, the exam credit will be added to your transcript and you will not need to take the exam in its released form.

    By participating in beta exams, you have the opportunity to provide the Microsoft Certification program with feedback about exam content, which is integral to development of exams in their released version. We depend on the contributions of experienced IT professionals and developers as we continually improve exam content and maintain the value of Microsoft certifications.

    Please remember that participation in the beta process is completely voluntary and Microsoft makes no promises or guarantees regarding the beta exam process. You can expect to receive your score on the beta exam within 12 weeks of taking the exam, although in some instances, beta exams may take longer to score and your results may be delayed.

    70-660: TS: Windows® Internals counts as credit towards the following certification.

    ·          Microsoft Certified Technology Specialist (MCTS): Windows® Internals  


    Availability

    Registration begins: July 17, 2008

    Beta exam period runs: July 21, 2008August 5, 2008

     

    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately. Beta exams have limited availability and are operated under a first-come-first-served basis. Once all beta slots are filled, no additional seats will be offered.


    Testing is held at Prometric testing centers worldwide, although this exam may not be available in all countries (see Regional Restrictions).  All testing centers will have the capability to offer this exam in its live version.

     

    Regional Restrictions: India, Pakistan, China


    Registration Information

    You must register at least 24 hours prior to taking the exam.
    Please use the following promotional code when registering for the exam: WINT
    Receiving this invitation does not guarantee you a seat in the beta; we recommend that you register immediately.


    To register in North America, please call:

    Prometric: (800) 755-EXAM (800-755-3926)

     

     

    Outside the U.S./Canada, please contact:

    Prometric: http://www.register.prometric.com/ClientInformation.asp

     

     


    Test Information and Support

    You are invited to take this beta exam at no charge.
    You will be given four hours to complete the beta exam. Please plan accordingly.
    Find exam preparation information: http://blogs.msdn.com/ntdebugging/pages/exam-preparation-information-for-exam-71-660.aspx

     


    Frequently Asked Questions

    For Microsoft Certified Professional (MCP) help and information, you may log in to the MCP Web site at http://www.microsoft.com/learning/mcp/or contact your Regional Service Center: http://www.microsoft.com/learning/support/worldsites.asp.

    What is a beta exam?

    Where can I learn more about the registration process?

    Where can I learn more about the beta exam invitation process?

    Where can I learn more about the new structure of Microsoft Certification?

    Who do I contact for help with this beta exam or other MCP questions?

     

    Thank you.


    Jeff Dailey, Microsoft Global Escalation Services.

  • Ntdebugging Blog

    Part 1 - ETW Introduction and Overview

    • 2 Comments

    Introduction:

     

    Event Tracing for Windows (ETW) is a system and software diagnostic, troubleshooting and performance monitoring component of Windows that has been around since Windows 2000. However, it wasn't until Windows Vista that major components of the OS were updated to heavily use ETW tracing; making it much more practical and useful.

    ETW is useful from a variety of scenarios, including:

     -User & Admin: Being in control of your system and knowing what is going on.
     -User & Admin: Troubleshooting performance, hardware and OS issues.
     -Enthusiast: Learning further about the OS and the low level guts of the OS.
     -Software Developer/ISV/OEM: Investigating issues with your software's interaction with Microsoft OS & technologies
     -Hardware Developer/IHV/OEM: Investigating issues with hardware interaction with the OS, including kernel, driver subsystems, up to the user stack.

    ETW is a set of technologies and tools that can absolutely complement existing tools while providing a look into the guts of the OS at a very low level.

    A great article from the April 2007 edition of MSDN Magazine - http://msdn.microsoft.com/en-us/magazine/cc163437.aspx, covers ETW in great depth, and is recommended reading.

    Here is graphical overview of the ETW infrastructure that covers how provider(s) log to high performance in memory buffers, which can be kept in memory in a circular buffer, or written to disk in a sequential or circular fashion.

     

    image

    Some uses of ETW:

     

    Today on Windows Vista, Windows 7, and sparingly on earlier OSes; ETW is used by Microsoft Development and Microsoft Support, as well as others; to help troubleshoot issues, find root-cause of bugs, analyze performance; and a large variety of other tasks.

    As it stands today, ETW has two major thrusts:

    1. Exposing time-based event data similar in practicality to plain old text based logged (without the normal performance overhead). Troubleshooting logs have long been used across the industry by software developers and IT professionals to troubleshoot issues. Here are some examples of how ETW logging is used.

    First off, shown below is Windows Event Viewer enabling viewing of analytic and debug logs (which show you many more logs than the default view):

    image

    A variety of logs can be viewed graphically using the Event Viewer or via the built-in command line OS utility – wevtutil. There are a quite a variety of logs that are enabled by default and are available via the Event Viewer or in a kernel memory dump.

    Since ETW is also a foundational component, you might find it appearing in other troubleshooting utilities. For example, Office Communications Server 2007 uses ETW for troubleshooting.

    ETW is used to monitor for network events from the kernel which are then added to Process Monitor, a utility from Microsoft Sysinternals. You should also see tools such netsh and NetMon being able to capture and decode ETW traces.

     

    image

    There are many other methods and utilities to view the ETW tracing providers available, which may be found in the OS, via the Device Driver Kit (DDK), enabled with Microsoft products (such as Exchange or IIS), or used by tools, such as Office Communication System Logging.

    The second major thrust of ETW is exposing performance data.

    2. Performance data is exposed from the lowest levels of the kernel, driver subsystems, and every major user-mode component of Windows (such as Windows Media Player).

    Performance data can be used in a hybrid fashion both for troubleshooting and performance reasons. Take for example a built in OS tool that uses ETW tracing to provide rich data - Resource Monitor.

    Resource Monitor is available from the Windows Task Manager - Performance Tab. The cool thing about Resource Monitor is that it internally uses kernel ETW tracing to gather its data, providing rich data from the kernel itself; and thus providing a practical exposure of ETW technology for Administrators to troubleshoot their systems. This goes beyond the data that Task Manager provides, which often tends to be based on performance counters which you cannot drill down into further.

    image

    image

    Resource Monitor is useful from both a performance standpoint, as well as troubleshooting. For example, you can view top cpu users, top disk users (why your disk is thrashing), or search for a locked file handle that is open.

    For a deeper drill-down into performance data XPerf can be used to capture ETW traces and make them available for decoding offline (on another box).

    Here, Xperf (available with the Windows DDK and as a standalone download) is shown looking at Internet Explorer's file access time.

    image

    There have been several posts on the NTDebugging Blog regarding XPerf, other blog posts, as well as the Windows Performance Analysis Dev Center.

    Where to go from here:

     

    In Windows 7 and Windows Server 2008 R2, ETW tracing has even been further improved with many more providers including 600+ inbox providers’ registered on the box. This number increases when add-on features or products are installed (such as Hyper-V).

    The tracing provided by ETW, whether it is used directly in the OS, or via add-on tools, provides deep insight into Windows and how it works. We will explore ETW in depth in future blog posts.

     

     

    Share this post :

  • Ntdebugging Blog

    Event ID 157 "Disk # has been surprise removed"

    • 9 Comments

    Hello my name is Bob Golding and I would like to share information on a new error you may see in the system event log. It is Event ID 157 "Disk <n> has been surprise removed" with Source: disk.  This error indicates that the CLASSPNP driver has received a “surprise removal” request from the plug and play manager (PNP) for a non-removable disk.

     

    What does this error mean?

    The PNP manager does what is called enumerations.  An enumeration is a request sent to a driver that controls a bus, such as PCI, to take an inventory of devices on the bus and report back a list of the devices.  The SCSI bus is enumerated in a similar manner, as are devices on the IDE bus.

     

    These enumerations can happen for a number of reasons.  For example, hardware can request an enumeration when it detects a change in configuration.  Also a user can initiate an enumeration by selecting “scan for new devices” in device manager.  

     

    When an enumeration request is received, the bus driver will rescan the bus for all devices.  It will issue commands to the existing devices as though it was looking for new ones.  If these commands fail on an existing unit, the driver will mark the device as “missing”.  When the device is marked “missing”, it will not be reported back to PNP in the inventory.  When PNP determines that the device is not in the inventory it will send a surprise removal request to the bus driver so the bus driver can remove the device object.

     

    Since the CLASSPNP driver sits in the device stack and receives requests that are destined for disks, it sees the surprise removal request and logs an event if the disk is supposed to be non-removable.  An example of a non-removable disk is a hard drive on a SCSI or IDE bus.  An example of a removable disk is a USB thumb drive.

     

    Previously nothing was logged when a non-removable disk was removed, as a result disks would disappear from the system with no indication.  The event id 157 error was implemented in Windows 8.1 and Windows Server 2012 R2 to log a record of a disk disappearing.

     

    Why does this error happen?

    These errors are most often caused when something disrupts the system’s communication with a disk, such as a SAN fabric error or a SCSI bus problem.  The errors can also be caused by a disk that fails, or when a user unplugs a disk while the system is running.  An administrator that sees these errors needs to verify the heath of the disk subsystem.

     

    Event ID 157 Example:

     

  • Ntdebugging Blog

    What Is In A RHS Dump File Created By Windows Error Reporting

    • 1 Comments

    Hello all, East here.  I wanted to give you a hint on how to use a RHS dump to find what thread was part of the Windows Server 2008 R2 cluster RHS recovery deadlock.

     

    First let me start off with letting you know that Windows Server 2008 R2 will create two types of user-mode dumps:

    1 - A heap dump (.hdmp) - which is an extended mini-dump that contains additional data such as the process memory.

    2 - A mini dump (.mdmp) - which will only contain mainly stack information and not enough info to help you get further.

     

    When a Windows Server 2008 R2 cluster has a resource that does not respond to one of the main cluster queries in a timely fashion, RHS will try to recover that resource by restarting it in an isolated process as well as creating a WER report and a user-mode dump of the deadlock.

     

    NOTE: More info about RHS and WER reports can be seen in the following blog - http://blogs.msdn.com/b/clustering/archive/2009/06/27/9806160.aspx

     

    When a deadlock is detected the cluster log will show:

    "00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued."

     

    You should find the WER report file and dumps located at the default location:

    <SystemDrive>\Users\All Users\Microsoft\Windows\WER\ReportQueue

     

    Here is what a cluster created Windows Error Report looks like:

    Version=1

    EventType=WSFC Resource Deadlock

    EventTime=129497767591366327

    ReportType=1

    Consent=1

    ReportIdentifier=d2277837-7d7b-11e0-b5d0-00155d06402c

    Response.type=4

    Sig[0].Name=Resource Name

    Sig[0].Value=New flexdisk

    Sig[1].Name=Resource Type

    Sig[1].Value=flexdisk

    Sig[2].Name=Call Type

    Sig[2].Value=ISALIVE

    DynamicSig[1].Name=OS Version

    DynamicSig[1].Value=6.1.7601.2.1.0.274.10

    DynamicSig[2].Name=Locale ID

    DynamicSig[2].Value=1033

    File[0].CabName=memory.hdmp

    File[0].Path=WERC1F2.tmp.hdmp <----------------------------- Name of the dump file   

    File[0].Flags=2097152

    File[0].Type=3

    File[0].Original.Path=C:\Windows\Temp\WERC1F2.tmp.hdmp

    File[1].CabName=minidump.mdmp

    File[1].Path=WERC2CE.tmp.mdmp

    File[1].Flags=2162690

    File[1].Type=2

    File[1].Original.Path=C:\Windows\Temp\WERC2CE.tmp.mdmp

    FriendlyEventName=WSFC Resource Deadlock

    ConsentKey=WSFC Resource Host Monitor

    AppName=Failover Cluster Resource Host Subsystem <--------- The application that had an issue - RHS

    AppPath=C:\Windows\Cluster\rhs.exe

    ReportDescription=WSFC Resource Deadlock

     

    Now you have the area to look for in the cluster log, and an idea of which resource caused RHS to deadlock, and what it was doing at the time:

     

    The Resource Name = New flexdisk

    The Resource Type = flexdisk

    The type of call the resource was engaged in = ISALIVE

     

    You have all this information in the cluster log as well:

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] RhsCall::DeadlockMonitor: Call ISALIVE timed out for resource 'New flexdisk'.  RHS deadlock during an Isalive check for the New flexdisk resource.

    00000dd4.00000aac::2011/05/13-16:12:39.136 INFO  [RHS] Enabling RHS termination watchdog with timeout 1200000 and recovery action 3.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] Resource New flexdisk handling deadlock. Cleaning current operation and terminating RHS process.

    00000dd4.00000aac::2011/05/13-16:12:39.136 ERR   [RHS] About to send WER report.

    000004dc.00000fb4::2011/05/13-16:12:39.136 WARN  [RCM] HandleMonitorReply: FAILURENOTIFICATION for 'New flexdisk', gen(3) result 4.

    000004dc.00000fb4::2011/05/13-16:12:39.136 INFO  [RCM] rcm::RcmResource::HandleMonitorReply: Resource 'New flexdisk' consecutive failure count 1.

    00000dd4.00000aac::2011/05/13-16:12:39.559 ERR   [RHS] WER report is submitted. Result : WerReportQueued.  WER report and user-mode dump created

    000004dc.00000fb4::2011/05/13-16:12:39.559 ERR   [RCM] rcm::RcmMonitor::RecoverProcess: Recovering monitor process 3540 / 0xdd4

    000004dc.00000fb4::2011/05/13-16:12:39.575 INFO  [RCM] Created monitor process 3280 / 0xcd0

    00000cd0.00000200::2011/05/13-16:12:39.575 INFO  [RHS] Initializing.

    000004dc.00000fb4::2011/05/13-16:12:39.591 INFO  [RCM] rcm::RcmResource::ReattachToMonitorProcess: (New flexdisk, OfflineCallIssued)

    000004dc.00000fb4::2011/05/13-16:12:39.591 WARN  [RCM] Canceling pending control GET_RO_PRIVATE_PROPERTIES for resource 'New flexdisk' due to monitor crash.

     

    Now if you want to get adventurous and review the dump file to see the call stacks of RHS and what the resource was doing, you will look at the WER dump with the extension “.hdmp”.  Since we are looking at a process with multiple threads, we need to find the one that is showing the WER reporting calls and the one showing the RHS deadlock.

     

    Run the following command after opening the dump under your debugger “~*k”.  In the below example, thread 3 is the one we are interested in. You can tell this because the process and thread ID matches the cluster log above.  If you don’t have a cluster log, this is the thread with wer.dll on the stack.

    0:000> ~*k

     

    .  0  Id: dd4.9d4 Suspend: 0 Teb: 000007ff`fffdd000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`001cf128 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`001cf130 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`001cf230 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`001cf2c0 00000000`ff74802c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`001cf3b0 00000000`ff7458a1 rhs!Rhs::WaitForTermination+0x74

    00000000`001cf450 00000000`ff769099 rhs!wmain+0x4b1

    00000000`001cfbd0 00000000`772c652d rhs!__wmainCRTStartup+0x13d

    00000000`001cfc10 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`001cfc40 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       1  Id: dd4.d0c Suspend: 0 Teb: 000007ff`fffdb000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00c3fa98 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00c3faa0 00000000`772c1220 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00c3fba0 00000000`ff763d53 kernel32!WaitForMultipleObjects+0xb0

    00000000`00c3fc30 00000000`ff75785c rhs!cxl::WaitHandle::WaitAny+0x3f

    00000000`00c3fd20 00000000`ff766988 rhs!PingAliveCallScheduler::Run+0x184

    00000000`00c3fda0 00000000`772c652d rhs!cxl::Thread::ThreadStartRoutine+0x18

    00000000`00c3fdf0 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00c3fe20 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       2  Id: dd4.e78 Suspend: 0 Teb: 000007ff`fffd9000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00baf998 00000000`774eb007 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00baf9a0 00000000`772c652d ntdll!TppWaiterpThread+0x14d

    00000000`00bafc40 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00bafc70 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       3  Id: dd4.aac Suspend: 0 Teb: 000007ff`fffd7000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00cbe5e8 000007fe`fd611420 ntdll!ZwWaitForMultipleObjects+0xa

    00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    <snipped to save space>

     

    We will switch context to thread 3 by using the “ ~3s ” command, and follow that with “ kn “ to show the stack with frame numbers.  Debug Tip: Since we already know that its thread id is “aac” we can also switch directly to thread 3 by using its thread id in the following command syntax:  “ ~~[aac]s ”

    0:003> kn

    # Child-SP          RetAddr           Call Site

    00 00000000`00cbe5e8 000007fe`fd611420 ntdll!NtWaitForMultipleObjects+0xa

    01 00000000`00cbe5f0 00000000`772d2d53 KERNELBASE!WaitForMultipleObjectsEx+0xe8

    02 00000000`00cbe6f0 00000000`773e8f7d kernel32!WaitForMultipleObjectsExImplementation+0xb3

    03 00000000`00cbe780 00000000`773e62b2 user32!RealMsgWaitForMultipleObjectsEx+0x12a

    04 00000000`00cbe820 00000000`773e62e0 user32!MsgWaitForMultipleObjectsEx+0x46

    05 00000000`00cbe860 000007fe`f6f3c35c user32!MsgWaitForMultipleObjects+0x20

    06 00000000`00cbe8a0 000007fe`f6f6dbd0 wer!UtilMsgWaitForMultipleObjects+0x54

    07 00000000`00cbe930 000007fe`f6f54158 wer!CReportManager::ReportProblemOutOfProcess+0x5c4

    08 00000000`00cbecc0 000007fe`f6f3b99b wer!CReportManager::ReportProblem+0x1eb

    09 00000000`00cbede0 000007fe`f6f61cbe wer!CReportHandleInstance::SubmitReport+0x386

    0a 00000000`00cbee50 00000000`ff750dcf wer!WerReportSubmit+0x16a

    0b 00000000`00cbeef0 00000000`ff75117b rhs!RhsResource::WERReportDeadlock+0x43f

    0c 00000000`00cbfaf0 00000000`ff74b270 rhs!RhsResource::HandleDeadlock+0x187

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    0e 00000000`00cbfc60 00000000`774fc521 kernel32!BaseThreadInitThunk+0xd

    0f 00000000`00cbfc90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    The frame of our concern will be frame d. You can either go the context ( /c ) or just display the registers  ( /r )  for that frame context:

    0:003> .frame /r d

    0d 00000000`00cbfb60 00000000`772c652d rhs!RhsCall::DeadlockMonitor+0x324

    rax=000007fffffd7000 rbx=0000000000346920 rcx=00000000007316c0

    rdx=0000000000000000 rsi=000000000035f8b0 rdi=000000000035f8b0

    rip=00000000ff74b270 rsp=0000000000cbfb60 rbp=0000000000000000

    r8=00000000ffffffff  r9=00000000000004ff r10=0000000000000000

    r11=0000000000cbe810 r12=0000000000000000 r13=0000000000000000

    r14=0000000000000000 r15=000000004dcd5877

    iopl=0         nv up ei pl zr na po nc

    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246

    rhs!RhsCall::DeadlockMonitor+0x324:

    00000000`ff74b270 448b442430      mov     r8d,dword ptr [rsp+30h] ss:00000000`00cbfb90=ffffffff

     

    Here is where we get into tricky territory about where to look. I know from experience that the RDI register from above is going to end up containing my information for this resource. Another way to verify you are in the right area, dumping RDI will show the CallType.

     

    Dump that out to find the thread containing the stack with:

    0:003> dqs 000000000035f8b0

    00000000`0035f8b0  00000000`ff704eb8 rhs!IsAliveCall::`vftable' == CallType

    00000000`0035f8b8  00000000`00000004

    00000000`0035f8c0  00000000`001e7da0

    00000000`0035f8c8  00000000`00000e48 <-- Thread id of concerning resource

    00000000`0035f8d0  00000000`00346920

    00000000`0035f8d8  0000000c`00000001

    00000000`0035f8e0  00000000`00000003

    00000000`0035f8e8  00000000`4dcd5876

    00000000`0035f8f0  00000000`048d2dd3

    00000000`0035f8f8  00000000`00000100

    00000000`0035f900  00000001`00000100

    00000000`0035f908  00000000`00000000

    00000000`0035f910  00430000`00000031

    00000000`0035f918  8a0000a4`52e48988

    00000000`0035f920  004c0041`00530049

    00000000`0035f928  00450045`00560049

     

    If we scan our previous “~*k” output, we should find the thread id above to match our thread with the problem resource, FlexRes:

       4  Id: dd4.e48 Suspend: 0 Teb: 000007ff`fffae000 Unfrozen

    Child-SP          RetAddr           Call Site

    00000000`00e2f0f8 000007fe`fd611203 ntdll!ZwDelayExecution+0xa

    00000000`00e2f100 000007fe`f78f1ad9 KERNELBASE!SleepEx+0xab

    00000000`00e2f1a0 00000000`00326aa0 FlexRes+0x11ad9

    00000000`00e2f1a8 00000000`00000000 0x326aa0

     

    In this case I was using an in house test app that just went to sleep instead of responding to the LooksAlive() check, which caused IsAlive() to fail.

     

    Also if you check the cluster log 5 minutes prior to the RhsCall::DeadlockMonitor error you will see the resource and thread id as well, showing what was going on prior to the Deadlock:

    00000dd4.00000e48::2011/05/13-16:07:37.270 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:1 Error = 0.

    000004dc.00000fd4::2011/05/13-16:07:37.286 WARN  [RCM] ResourceControl(SET_PRIVATE_PROPERTIES) to New flexdisk returned 5024.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Set new entry point  = ISALIVEERROR:1722,ISALIVEFAILTYPE:1,ISALIVEHANG:0 Error = 0.

    00000dd4.00000e48::2011/05/13-16:07:38.023 INFO  [RES] flexdisk <New flexdisk>: Going to Hang in IsAlive.

     

    In most cases the cluster log can provide enough information to help you narrow down possible cluster resource issues.  If not, the user-mode dump may be able to help verify additional data, especially if you are debugging a third party resource dll that is deadlocking RHS.

  • Ntdebugging Blog

    For testing purposes only, and other desktop watermarks

    • 3 Comments

    Hi all, Matt here again.  One of our team’s main functions is to work with our development teams to create hotfixes when customers run into issues that can only be resolved through a code change.  The developers will often prepare a private test fix that either tests the proposed change, or adds additional instrumentation to help pinpoint the issue. The private test fix is sent to the customer reporting the problem so they can confirm that it does indeed correct (or identify) the flaw.

     

    When testing a private test fix, customers frequently ask, why does my desktop now show a message on the lower right corner of the desktop, and what does it mean?  The message reads “For testing purposes only”, and looks like this:

    For testing purposes only

     

    Often, users are concerned that his message means that they aren’t allowed to use the server in production, or maybe that it is now “unsupported.”  These aren’t the case!  Since this message appears as a result of installing a fix during the course of a Microsoft Support case, the servers are, by definition, being supported. 

     

    The purpose of this message is simply to remind users that code that Microsoft Support has asked them to test has been installed on the system, and this code may not have yet undergone the full suite of quality assurance testing that fixes that are made public usually do.   

     

    For comparison, let’s look at some of the other watermarks you may find in the lower corner of the desktop – as these can often be confused for the above message, and may explain some of the customer concerns around these messages.

     

    First up is the old trusty text you see when a box is booted into ‘Safe Mode’. I’m sure every IT Pro has seen this at one time or another, so I won’t go into detail, but rest assured, the testing purposes text is completely unrelated to booting in safe mode or having a subset of services running.

    Safe Mode

     

    Next up is our ‘Evaluation copy’ watermark.  This message is shown on the desktops of copies of Windows that have a “time bomb” (ones that will cease to function after a certain date.)  This message is typically seen on beta versions of Windows which are designed to stop functioning sometime after the desired beta testing period ends. 

    Evaluation copy

     

    Third, we have our Windows is not genuine message.  This is shown if, for example, a copy of Windows is not activated during the grace period after the installation process, or if a number of hardware changes have been made and Windows needs to be re-activated.  This has nothing to do with the ‘testing purposes’ message.  See http://technet.microsoft.com/en-us/library/dd979803.aspx for more information about this message.

    This copy of Windows is not genuine

     

    Fourth, we have the general Windows build stamp.  This is enabled via the registry using the PaintDesktopVersion DWORD (http://technet.microsoft.com/en-us/library/cc782257(WS.10).aspx).  Some administrators like to enable this option so they always know what version of Windows they are using, sort of like a mini-bginfo.  Unlike the others, this message does not indicate anything else about a server’s state.

    Windows 7

     

    Finally, we have ‘Test Mode’.  This is actually somewhat related to the testing purposes message.  This ‘Test Mode’ text is shown when test signing is enabled on a PC.  This is done by running “bcdedit /set testsigning on” from an UAC-elevated command prompt.  Test signing is used to allow developers to load drivers they are still working on that have not yet been code signed with an official certificate.  This is actually one of the steps we need to do when loading our test fixes.  For more information on Test Signing, see http://msdn.microsoft.com/en-us/library/ff553484%28v=VS.85%29.aspx.

    Test Mode

     

    So now that you know what causes these various watermarks to appear, perhaps you’re wondering how to make the “For testing purposes only” message disappear.  This is a question we are frequently asked.  While you are running a private test fix, there is no way to disable this message.  Your only option is to remove the private test fix from your system.  This is something your engineer will ask you to do before you install the final, public version of a hotfix.  You can easily identify and uninstall private test fixes by going into Control Panel, Programs and Features, View Installed Updates, then look for hotfixes with the words FOR TESTING PURPOSES ONLY in their name, like the one shown in the image below.  You may also notice that the KB number listed for these fixes is often a place holder, and not a real KB article ID.

    Installed Updates

     

    If the ‘For testing purposes only’ message is still displayed even after uninstalling the private test fix, there is one more place to check.  If a system has the Microsoft Test Root Authority certificate installed into its Trusted Root Certification Authorities store, the text will be displayed.  We use this certificate to allow a PC to run test code that has been signed by our development team, but not yet fully tested and signed off with the official Microsoft certificate.  To remove this certificate from your system, go to Start -> Run, and enter certmgr.msc and hit enter.  In the Certificate Manager MMC, browse to Trusted Root Certification Authorities, then into Certificates.  You should see one labeled Microsoft Test Root Authority, as shown below.  This will need to be deleted and the system rebooted to remove the testing purposes message.  Do not do this if you still have a private test fix installed though, as it would prevent that binary from continuing to function and may mean you can no longer boot in to Normal or Safe mode.

    certmgr

     

    If you reboot and find that ‘Test Mode’ has replaced the ‘For testing purposes only’ text, you’ll need to launch a command prompt with administrative privileges, then run “bcdedit /set testsigning off” and reboot.  You can always check if test signing is enabled by running “bcdedit /enum” and looking for this line:

    bcdedit /enum

     

    That’s all for today.  Hopefully this post helped clear up any confusion about our different desktop watermarks.

  • Ntdebugging Blog

    Configuring a Hyper-V VM For Kernel Debugging

    • 2 Comments

    Yesterday's blog prompted some questions about how to set up a debugger for a Windows OS running in a Hyper-V VM.  I was surprised that I wasn't able to find good, publicly available, Microsoft issued documentation for this configuration.

     

    The first step is to configure the Windows OS in the VM to enable a kernel debugger on COM1.  One would use these same steps if you were preparing the OS to be debugged using a null modem cable.  Hyper-V will allow us to redirect the COM port so that we don't need such a cable.

    1. Start an administrative command prompt.
    2. Turn on debugging with this command:
      bcdedit /debug on
    3. Configure the debugger to use COM1 with this command:
      bcdedit /dbgsettings SERIAL DEBUGPORT:1 BAUDRATE:115200
      Note that these are the default settings and already exist in most bcd stores.  However setting them again won't damage anything, and guards against a situation where the dbgsettings have been previously modified.
    4. Reboot so that the boot loader can read the new settings and configure the OS for debugging.

    CommandPrompt

     

    Next, configure Hyper-V to redirect the COM1 port to a named pipe.  We will use this pipe in place of a traditional null modem cable.

    1. Open Hyper-V Manager and browse to the settings page of the VM you configured to debug.
    2. Under the Hardware list choose COM 1.
    3. Change the Attachment to 'Named pipe:' and provide a pipe name.
      1. Note that the Hyper-V Manager provides the complete path to your named pipe.  Make a note of this path as you will need it in the next step.

    Settings2

     

    After the OS and the VM are configured for debugging, we need to connect a debugger.

    1. On the Hyper-V parent partition download and install the Debugging Tools for Windows from http://msdn.microsoft.com/en-us/windows/hardware/gg463009.
    2. After installing the debugging tools you will have a ‘Debugging Tools for Windows’ entry in your start menu.
      1. From this folder right click ‘WinDbg’ and choose ‘Run as administrator’.  Windbg needs administrative rights to connect to the pipe.
    3. In windbg open the File menu and choose ‘Kernel Debug’.
    4. Enter a Baud Rate of 115200, to match the settings made in the VM.
    5. Enter the Port that you configured in the VM settings page.
      1. To connect to the pipe remotely, substitute the '.' in the path with the Hyper-V server name.
    6. Ensure that the Pipe and Reconnect boxes are checked.
    7. Set Resets to 0.
    8. Click OK to start debugging.
    9. Windbg should display the string ' Waiting to reconnect...'

    image004

     

    To test the debugger connection in windbg, from the ‘Debug’ menu choose ‘Break’.  This should cause the server to break into the debugger and display a kd> prompt.  Please note that breaking into the debugger will cause the OS running in the VM to halt until you tell the debugger to go, the OS will appear to be hung during this time.  The command 'g' followed by Enter will tell the debugger to ‘go’ causing the VM to resume operation.

     

    Windbg

  • Ntdebugging Blog

    How to debug WOW64 applications?

    • 3 Comments

     

    Hello, my name is Venkatesh Ganga, and I’m an Escalation Engineer on the Microsoft Platform Global Escalation Services team.  Recently, I worked on an issue where registry reflection was not happening while installing 32bit Office under the system account. This required looking into both the 32bit Office code and the Wow64 code where the registry reflection is implemented.  When attaching to the Wow64 process using the 32bit debugger it’s like debugging a 32bit process on the 32bit machine; there are no 64bit binaries in the process. However, we needed to debug Wow64 to debug the registry reflection code. To do this we attached to the Wow64 process using the 64bit debugger which allows you to see the Wow64 binaries.

    Ø  lm

                Base TimeStamp                     Module

              400000 42435b2a Mar 24 18:28:26 2005 C:\Program Files (x86)\Internet Explorer\IEXPLORE.EXE

            77ec0000 45d6cc72 Feb 17 03:35:46 2007 C:\WINDOWS\system32\ntdll.dll

            6b000000 45d6943d Feb 16 23:35:57 2007 C:\WINDOWS\system32\wow64.dll

            6b280000 45d695f3 Feb 16 23:43:15 2007 C:\WINDOWS\system32\wow64win.dll

            78b80000 42438b7a Mar 24 21:54:34 2005 C:\WINDOWS\system32\wow64cpu.dll

     

    There are 2 options for debugging Wow64 applications.

    1.       Using the 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    2.       Using the 32bit debugger

    Using 64bit debugger and Wow64 debugger extension (Wow64exts.dll)

    I ran the 32bit version of Internet Explorer on a 64bit machine and attached to it using the 64bit debugger. Here is the thread 0 call stack when viewed from 64bit debugger.

    0:000> kL

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

     

    The above stack only shows the 64 bit calls; we cannot see what the 32 bit calls are doing. To get the 32bit stack you must use one of the below methods.

    Ø  Option 1 : Run “!wow64exts.k”

    0:000> !wow64exts.k

    Walking 64bit Stack…

    Child-SP          RetAddr           Call Site

    00000000`0013edf8 00000000`78b8428e wow64cpu!CpupSyscallStub+0x9

    00000000`0013ee00 00000000`6b006a5a wow64cpu!Thunk0Arg+0x5

    00000000`0013ee70 00000000`6b005e0d wow64!RunCpuSimulation+0xa

    00000000`0013eea0 00000000`77ed8030 wow64!Wow64LdrpInitialize+0x2ed

    00000000`0013f6d0 00000000`77ed582f ntdll!LdrpInitializeProcess+0x1538

    00000000`0013f9d0 00000000`77ef30a5 ntdll!_LdrpInitialize+0x18f

    00000000`0013fab0 00000000`77d59620 ntdll!KiUserApcDispatch+0x15

    00000000`0013ffa8 00000000`00000000 0x77d59620

    00000000`0013ffb0 00000000`00000000 0x0

    00000000`0013ffb8 00000000`00000000 0x0

    00000000`0013ffc0 00000000`00000000 0x0

    00000000`0013ffc8 00000000`00000000 0x0

    00000000`0013ffd0 00000000`00000000 0x0

    00000000`0013ffd8 00000000`00000000 0x0

    00000000`0013ffe0 00000000`00000000 0x0

    00000000`0013ffe8 00000000`00000000 0x0

    00000000`0013fff0 00000000`00000000 0x0

    00000000`0013fff8 00000000`00000000 0x0

    00000000`00140000 00000020`78746341 0x0

    00000000`00140008 00005370`00000001 0x20`78746341

    Walking 32bit Stack...

    ChildEBP          RetAddr          

    002ded98 75ec1c83 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 KERNEL32!BaseProcessStart+0x28

     

    Ø  Option 2 : Switch to x86 mode (using “!wow64exts.sw”) and do KB.

     

    0:000> !wow64exts.sw

    Switched to 32bit mode

    0:000:x86> kb

    ChildEBP          RetAddr           Args to Child                                        

    002ded98 75ec1c83 002f1be8 002dee50 002f1be8 USER32!NtUserWaitMessage+0x15

    002dee24 75ec61ef 002f1be8 002f1be8 00000000 BROWSEUI!BrowserProtectedThreadProc+0x44

    002dfea8 779ba3a6 002f1be8 00000001 00000000 BROWSEUI!SHOpenFolderWindow+0x22c

    002dfec8 0040243d 002e2508 00000001 ffffffff SHDOCVW!IEWinMain+0x129

    002dff1c 00402748 00400000 00000000 002e2508 IEXPLORE!WinMain+0x316

    002dffc0 7d4e7d2a 00000000 00000000 7efdf000 IEXPLORE!WinMainCRTStartup+0x186

    002dfff0 00000000 004025c2 00000000 000000c8 KERNEL32!BaseProcessStart+0x28

     

    The easiest way to see all of the 32bit call stacks is by switching to  32bit mode (!wow64exts.sw) and doing ~*k.  In addition, you can set breakpoints in 32bit or 64 bit binaries using the 64bit debugger.  Also note  “!peb” will show both the 64bit and 32bit PEB.

     

    Using the 32bit debugger

    As mentioned earlier there is nothing wrong with using the 32 bit debugger.  If you just need to debug the application’s 32bit code, using it is probably the simplest approach.  However, if you need to view Wow64 code or binaries, you must use the 64bit debugger.  Note that these techniques apply to debugging Wow64 dumps and live processes.

    You can find more information about WoW64 applications at http://msdn.microsoft.com/en-us/library/aa384249(VS.85).aspx 

     

Page 5 of 24 (235 items) «34567»