• Ntdebugging Blog

    Ironical Case of a Performance Monitoring Tool causing a Performance Issue!

    • 1 Comments

    Hello folks, this is Pushkar and I recently worked an interesting case dealing with high CPU usage. The case was particularly interesting but it was not a business critical application consuming the CPU. I was tasked to identify why a third party Performance Monitoring tool was causing a performance issue on the server by consuming 100% of the CPU cycles. The irony of the issue itself made this particular case immensely interesting for me and I decided to investigate it further. Typically issues with third party products are addressed by the Application Vendor.

    In this case the monitoring tool, Monitor.exe (disclaimer: Identity of the actual product has been removed), was consistently consuming 100% of the CPU cycles, and as expected if the tool was stopped, the CPU usage was back to normal. As I didn’t have information about how the tool worked, I decided to gather data to give me an idea about the underlying API's the tool was calling. This would give me an understanding of its behavior.

    To begin with the troubleshooting I gathered Kernrate.exe logs from the server along with a memory dump. On platforms which are Windows Vista and upwards you can use the Windows Performance Toolkit (aka. Xperf), a better alternative to Kernrate.

    Note: To learn more about kernrate.exe check here

    A quick review of the Kernrate Log showed a lot of CPU time in the Kernel, and the function mostly called was NtQueryDirectoryObject()

     

     

    TOTAL K 0:07:35.343 (49.2%)  U 0:02:40.734 (17.4%)  I 0:05:09.171 (33.4%)  DPC 0:00:52.984 (5.7%)  Interrupt 0:00:20.312 (2.2%)

           Total Interrupts= 4208713, Total Interrupt Rate= 9098/sec.

     

    -----------------------------

     

    Results for Kernel Mode:

    -----------------------------

     

    OutputResults: KernelModuleCount = 106

    Percentage in the following table is based on the Total Hits for the Kernel

     

    ProfileTime   116300 hits, 65536 events per hit --------

     Module                                Hits   msec  %Total  Events/Sec

    PROCESSR                              45160     462654    38 %     6397017

    NTOSKRNL                              43573     462654    37 %     6172215

    SYMMPI                                18258     462654    15 %     2586287

     

     

    ----- Zoomed module NTOSKRNL.EXE (Bucket size = 16 bytes, Rounding Down) --------

    Percentage in the following table is based on the Total Hits for this Zoom Module

     

    ProfileTime   43573 hits, 65536 events per hit --------

     Module                                Hits   msec  %Total  Events/Sec

    NtQueryDirectoryObject                 8433     462654    19 %     1194553

    memmove                                6584     462654    14 %      932638

    KeInsertQueueDpc                       5593     462654    12 %      792261

    KeFlushMultipleTb                      3118     462654     7 %      441671

     

    Further Investigation of a dump gathered indicated the function listed above was invoked to parse through the \BaseNamedObjects namespace in the kernel. BaseNamedObjects is a folder in the Kernel Object Namespace where various kernel objects: events, semaphores, mutexes, waitable timers, file-mapping objects, and job objects are created. The purpose of such global namespace is to enable processes on multiple client sessions to communicate with a service application. Another use of such namespace is for applications that use named objects to detect that there is already an instance of the application running in the system across all sessions.

    I started looking under the \BaseNamedObjects namespace in the kernel and found there were numerous, over 2900 objects under that Namespace. You can use SysInternal tool WinObj.exe to browse through the Kernel Object Namespace and the list of objects within those namespaces. In my case I had no prior idea of the cause of the issue so I had to look at the dump file.  I pasted a snapshot of the WinObj.exe ouput below along with a stack from the kernel dump.

    WinObj Screenshot

    kd> !object \BaseNamedObjects

    Object: fffffa8000d72ec0  Type: (fffffadfe7acb1e0) Directory

        ObjectHeader: fffffa8000d72e90 (old version)

        HandleCount: 36  PointerCount: 10275        <- This indicates the Number of Object under this namespace

        Directory Object: fffffa8000004060  Name: BaseNamedObjects

     

        Hash Address          Type          Name

        ---- -------          ----          ----

         00  fffffadfe5d79660 Job           TestJobObj_9920

             fffffadfe5d7a8e0 Job           TestJobObj_9913

             fffffadfe5d7b8e0 Job           TestJobObj_9907

             fffffadfe5d84060 Job           TestJobObj_9850

             fffffadfe5d863e0 Job           TestJobObj_9843

             fffffadfe5d873e0 Job           TestJobObj_9837

             fffffadfe5d8e060 Job           TestJobObj_9790

             fffffadfe5d903e0 Job           TestJobObj_9783

             fffffadfe5d913e0 Job           TestJobObj_9777

             fffffadfe5dad660 Job           TestJobObj_9611

             fffffadfe5dae660 Job           TestJobObj_9605

             fffffadfe5db7660 Job           TestJobObj_9551

             fffffadfe5db8660 Job           TestJobObj_9545

             fffffadfe5db98e0 Job           TestJobObj_9538

    <Snipped>

    In my case the third party performance monitoring tool was experiencing high CPU when they queryied “Job Objects” and “Job Object Details” performance counters.

    Since this was happening with a third party tool, I tried to recreate the issue in-house, and wrote a small utility to create a lot of Named Job Objects under \BaseNamedObjects namespace using CreateJobObject() API. Then I tried to use the plain old Perfmon tool built into Windows. The moment I tried to add the “Job Objects” & “Job Object Details” counters, the CPU utilization from MMC.EXE hit 90% (Perfmon runs as an MMC). Now I could reproduce the issue in-house so I investigated what happened when we try to query those counters.

    Here is what I found:

    When we attempt to query the Job Object or Job Object Details performance counter it invokes the following functions respectively in Perfproc.dll which would invoke NtQueryDirectoryObject() to locate the Object under the \BaseNamedObjects Namespace. Then it calls  QueryInformationJobObject to gather the performance data related to that objects. One of the thread stacks is given below to show how the flow goes through

     

    CollectJobDetailData()

    CollectJobObjectData()

     

    kd> kb

    RetAddr           : Args to Child                                                           : Call Site

    fffff800`01040abd : 00000000`000ad490 fffff800`01385b47 00000000`00002990 00000000`00000000 : nt!NtQueryInformationJobObject

    00000000`77ef171a : 00000000`77d423cb 00000000`000ad498 00000000`00000000 00000000`c0000135 : nt!KiSystemServiceCopyEnd+0x3

    00000000`77d423cb : 00000000`000ad498 00000000`00000000 00000000`c0000135 00000000`00000000 : ntdll!ZwQueryInformationJobObject+0xa

    000007ff`5a794a16 : 00000000`000ad990 00000000`00000008 00000000`00000400 00000000`000addf8 : kernel32!QueryInformationJobObject+0x77

    000007ff`5a7932b9 : 00000000`0000dd20 000007ff`5a79aa84 00000000`00000001 00000000`00000040 : perfproc!CollectJobObjectData+0x356

    000007ff`7feeb497 : 00000000`00000000 00000000`000adee0 00000000`00000000 00000000`00000001 : perfproc!CollectSysProcessObjectData+0x1f9

    000007ff`7fef09d1 : 00000000`000ae1d0 00000000`00000000 00000000`0000007a 00000000`00000000 : ADVAPI32!QueryExtensibleData+0x951

    000007ff`7fef0655 : 0039002d`00350038 00000000`77c43cbd 00000000`00000000 00000000`00000000 : ADVAPI32!PerfRegQueryValue+0x66d

    000007ff`7ff0b787 : 00000000`00000000 00000000`00000000 00000000`000ae8a0 00000000`000ae888 : ADVAPI32!LocalBaseRegQueryValue+0x356

    000007ff`5b17ba27 : 00009ae8`b8ee62a9 00000000`000c46a0 00000000`00200000 00000000`00000000 : ADVAPI32!RegQueryValueExW+0xe9

     

    You can determine the type of performance data queried for each Job Object in Win32_PerfFormattedData_PerfProc_JobObject Class. Although it documents a WMI interface for querying the Performance Data, it represents the same data set the native API also queries.

    Once I understood what was happening under the hood, it was easy to deduce that every iteration of the query will be a CPU intensive operation. For each object we are going to do a recursive query for the Performance Data as mentioned in the above document. If the number of items is large, and we are running a broader query of the nature “All Counters” & “All Instances” then every single iteration would consume a lot of CPU. To validate this, I got help from one of my colleagues, Steve Heller, who modified the sample code received from the vendor of the Monitoring Tool (Which demonstrated what they were doing). We noticed they were querying this information every 2.5 seconds. The sample tool demonstrated that with about 10,000 Job Objects, one single iteration of the query was taking roughly 12.5 seconds to complete. No wonder the CPU usage would remain high, because before we finish a single iteration of the query for all the Job Objects, four additional queries were already queued and the CPU Usage continued to remain at 90% or more.

    The conclusion that can be drawn from this test is that querying \BaseNamedObjects namespace with a large number of objects under it will invariably result in a fair amount of CPU usage. This could get worse if the query is performed for complex objects e.g., Job Objects for which we would run nested queries for individual Performance Data for each Job Object, and if the data is queried too frequently.

    The two aspects responsible for the high CPU usage are:

    1.       The frequency at which the Performance Data was being queried

    2.       The number of objects under \BaseNamedObjects namespace.

     

    Though there is no defined limitation on the number of objects that can be created under various Kernel Object Namespaces, absence of such documents doesn’t mean that we should simply create a lot of objects. Applications should be designed carefully to use such kernel resources judiciously.

    At the same time because we can query a lot of performance data as the interfaces are publicly available, doesn’t mean we should query them at a very high frequency. Certain operations can be CPU intensive as we have seen here and we should identify what we really want to query and at what frequency.

    Quoting for the MSDN article Job Objects

     “A job object allows groups of processes to be managed as a unit. Job objects are namable, securable, sharable objects that control attributes of the processes associated with them. Operations performed on the job object affect all processes associated with the job object.”  

    The above point clearly indicates that the Job Object Framework is intended for processes which are participating in a Job to monitor and control each other’s resource usage.  

    There could be different reason for which a system administrator needs to gather information about the resource usage by a particular Process. One of them could be to meet Service Level Agreement. In that case we can monitor individual processes running on the server with the counters available under “Performance Object -> Process”.

    Equipped with the knowledge that querying Performance Data for Job Object may result in excessive CPU usage on the server, an Administrator should evaluate or consult the application owner (If needed) to understand whether they really want to gather this performance data. 

    The following documents are available for application developers to assist with the creation of application specific namespaces. Creation of a private namespace and creating objects under that namespace would ensure that application specific objects are not exposed to the Performance Library because they won’t appear under \BaseNamedObjects namespace.

    Object Namespaces

    http://msdn.microsoft.com/en-us/library/ms684295(VS.85).aspx  

     

    CreatePrivateNamespace Function

    http://msdn.microsoft.com/en-us/library/ms682419(VS.85).aspx

     

    Job Objects

    http://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx

     

    CreateJobObjects Function

    http://msdn.microsoft.com/en-us/library/ms682409(VS.85).aspx

     

    This problem would be predominantly noticed on older systems. To mitigate it to a certain extent, Windows Vista (and all post operating systems) have a per-session BaseNamedObjects namespace (i.e., it application-specific objects will be created under namespace in the following form \Sessions\<SessionID>\BaseNamedObjects). Hence these objects will not be exposed to the Performance Libraries and will not be available to query via Performance Monitoring Tools. Unfortunately the changes which were done to the kernel to make this work are too intricate to be back-ported to Windows 2003 SP2.

    This brings me to the end of this post, I hope this information gives you better insight on the how oversight in an applications design can lead to the breaking of some other components at a later stage.

     

     

  • Ntdebugging Blog

    Case Study - Software Restriction Policies and Large EXE Files

    • 0 Comments

    Recently I received a debugging request for a customer having problems running large executables. On their systems, they could run most EXEs without any problems, but they had two programs that were over 1.8 GB which, when run, would display the following error:

    clip_image002

    If they tried to run them in a command prompt, they received the message “Access is denied.” Both attempts were made with an administrator account and in neither case were the processes created. Through testing, they found that the programs worked if they were scheduled to run as System and also worked when run in safe mode as an administrator.

    When the case was brought to my attention, it was noted that when the failing executables were run, the following appeared in process monitor logs:

    clip_image004

    The engineer did not see this when one of the problematic EXEs was run (successfully) on his test machine. The customer provided a VM image of their system which we set up in HyperV with a named pipe kernel debugger. I then started kernel debugging to find the cause of the INVALID PARAMETER error, hoping that resolving it would fix the issue.

    To start, I looked at the call stack within process monitor for the invalid parameter:

    clip_image005

    The problem is this isn’t exactly where we return invalid parameter. Looking at the source code for Fltmgr, it doesn’t return invalid parameter – this was just where the error was caught in procmon. This call stack did provide some ideas for good starting places to debug, however. First, I looked up the hex value for STATUS_INVALID_PARAMETER in ntstatus.h – 0xC000000D. Knowing this, I decided to set a breakpoint on nt!IofCallDriver and ran the program. Once the debugger broke in, I planned to use wt -oR. This will trace through the calls displaying the return values next to each call. From there, I would just need to find 0xC000000D on the return column. Unfortunately, what I had forgotten was wt does not display return codes in kernel debugging, only when debugging user mode.

    With wt not an option, I decided to use a combination of debugger commands to approximate the output of wt. I knew the return value I was looking for, and I was also confident that I would find that code in the EAX register after the problem occurred. As such, I needed to write a loop that would walk through the instructions until it found 0xC000000D in EAX. The debugger provides two main options for walking instructions: p and t. p (Step) will execute a single instruction and display the register values. If the instruction is a call, it will not enter that function, but just display the results after that subroutine has been executed. t (Trace) also executes a single instruction, but it will enter into the function and will display each instruction.

    In this case I wanted trace so I could see which function was returning the invalid parameter status. Tracing though that many instructions/functions would take a long time, but there are some variations on t (and p) that can help. tc (or pc)will execute instructions until a call statement is reached, where it will break and show the register values. tt (or pt) will execute instructions until a return instruction is reached. tct (or pct) will run until either a call or return is reached. In this case, I opted for tct.

    Knowing that I would use tct, I had to find a way to execute tct statements until EAX was the value I was looking for. This can be accomplished with the z (While) debugger command. The syntax is pretty easy, it’s just z(expression) and it works just like a do-while loop. Putting it all together, I used this command in the debugger:

    tct; z(eax!=0xc000000d)

    I then waited for the debugger to break in so I could see where this status was being thrown. Regrettably, the code called ended up going in to some recursion which made my while loop take way too long. To resolve this, I set a new breakpoint just before we entered the recursion, reran the program, used p to step past the call then ran the tct loop.

    This time I was quickly brought to the code I was looking for. As soon as it broke in, I ran k to view the callstack:

    kd> k

    ChildEBP RetAddr
    b9541a3c f7b7fab9 Ntfs!NtfsCommonDeviceControl+0x40
    b9541aa0 f7b8b02f Ntfs!NtfsFsdDispatchSwitch+0xe4
    b9541bbc 8081df85 Ntfs!NtfsFsdDispatchWait+0x1c
    b9541bd0 f7876d28 nt!IofCallDriver+0x45
    b9541bfc 8081df85 fltmgr!FltpDispatch+0x152
    b9541c10 f7876d28 nt!IofCallDriver+0x45
    b9541c3c 8081df85 fltmgr!FltpDispatch+0x152
    b9541c50 808f5437 nt!IofCallDriver+0x45
    b9541c64 808f61bf nt!IopSynchronousServiceTail+0x10b
    b9541d00 808eed08 nt!IopXxxControlFile+0x5e5
    b9541d34 808897bc nt!NtDeviceIoControlFile+0x2a
    b9541d34 7c82860c nt!KiFastCallEntry+0xfc
    0012e960 7c826fe9 ntdll!KiFastSystemCallRet
    0012e964 77e416f9 ntdll!NtDeviceIoControlFile+0xc
    0012e9c8 77e6738d kernel32!DeviceIoControl+0x137
    0012ec44 77e67032 kernel32!GetVolumeNameForRoot+0x16d
    0012ec68 77e67782 kernel32!BasepGetVolumeNameForVolumeMountPoint+0x73
    0012ecd0 7d20b01d kernel32!GetVolumePathNameW+0x1c7
    0012ef18 7d20ae2c ADVAPI32!CodeAuthzFullyQualifyFilename+0xbc
    0012eff8 7d20b33f ADVAPI32!__CodeAuthzpIdentifyOneCodeAuthzLevel+0x19f
    0012f07c 77e6df9e ADVAPI32!SaferIdentifyLevel+0x163
    0012f278 77e6ce03 kernel32!BasepCheckWinSaferRestrictions+0x60c
    0012fa90 77e424b0 kernel32!CreateProcessInternalW+0xc0e
    0012fac8 4ad0256f kernel32!CreateProcessW+0x2c
    0012fc24 4ad01a2b cmd!ExecPgm+0x221
    0012fc58 4ad019b3 cmd!ECWork+0x84
    0012fc70 4ad03c58 cmd!ExtCom+0x40
    0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
    0012fee0 4ad0c30b cmd!Dispatch+0x137
    0012ff44 4ad07786 cmd!main+0x216
    0012ffc0 77e6f23b cmd!mainCRTStartup+0x12f
    0012fff0 00000000 kernel32!BaseProcessStart+0x23

    If we look at the assembly around Ntfs!NtfsCommonDeviceControl+0x40, we see that only if our return from NtfsDecodeFileObject is not equal to 4 it would move 0xC000000D in to esi, and then move esi to eax, :

    f7b7faf9 e8e904fdff call Ntfs!NtfsDecodeFileObject (f7b4ffe7)
    f7b7fafe 83f804 cmp eax,4
    f7b7fb01 0f848873ffff je Ntfs!NtfsCommonDeviceControl+0x54 (f7b76e8f)

    Ntfs!NtfsCommonDeviceControl+0x40:
    f7b7fafe 83f804 cmp eax,4
    f7b7fb01 0f848873ffff je Ntfs!NtfsCommonDeviceControl+0x54 (f7b76e8f)
    f7b7fb07 be0d0000c0 mov esi,0C000000Dh
    f7b7fb0c 56 push esi
    f7b7fb0d 53 push ebx
    f7b7fb0e 57 push edi
    f7b7fb0f e83506fdff call Ntfs!NtfsCompleteRequest (f7b50149)
    f7b7fb14 8bc6 mov eax,esi

    I looked at the source code for these functions, and it didn’t make sense that a failure here would cause the problems we were seeing; especially specific to large executables. Out of curiosity I ran notepad on the VM again with procmon and found that it too displayed INVALID PARAMETER, but the program started and ran correctly:

    clip_image007

    Since this wasn’t the problem, I stopped reviewing the code and decided on a new approach. We knew that when running the EXE in a command prompt we received an “Access is denied message”. At that point it made sense to switch to user mode debugging and take a look at the cmd.exe process that was trying to launch install.exe

    Doing user mode debugging in a VM can be a bit of a challenge, especially if you are trying to minimize changes to the VM (and in my case, avoid putting any symbols on the customer’s VM image). Since I already had a kernel debugger attached, one option would be to run ntsd.exe (debugger provided in the Debugging Tools for Windows) on the VM with the -p switch specifying the PID of the cmd.exe process I wanted to debug and -d switch which forwards the i/o of ntsd to the kernel debugger. The problem with this approach is the kernel debugger just becomes a front end for issuing commands and seeing the output from ntsd. That means all symbol resolution is still done on the target system running ntsd.

    I didn’t want to give the customer VM Internet or corporate network access, so I instead opted to run dbgsrv.exe on the VM. Running “dbgsrv -t tcp:port=9999” tells the debug server to listen on TCP port 9999 for debugger connections. Then, on the HyperV server computer I could just run windbg -premote tcp:server=(IP of VM),port=9999 -p (PID of cmd on VM) to debug it.

    I suspected that we may be calling CreateProcess but it was failing, so I set a breakpoint on kernel32!CreateProcessW. Sure enough, it hit when I tried to run install.exe in the command prompt:

    0:000> k
    ChildEBP RetAddr
    0012fac8 4ad0256f kernel32!CreateProcessW
    0012fc24 4ad01a2b cmd!ExecPgm+0x221
    0012fc58 4ad019b3 cmd!ECWork+0x84
    0012fc70 4ad03c58 cmd!ExtCom+0x40
    0012fe9c 4ad01447 cmd!FindFixAndRun+0xa9
    0012fee0 4ad0c30b cmd!Dispatch+0x137
    0012ff44 4ad07786 cmd!main+0x216
    0012ffc0 77e6f23b cmd!mainCRTStartup+0x12f
    0012fff0 00000000 kernel32!BaseProcessStart+0x23

    This time I could use wt -oR since this was a usermode debug. Looking in ntstatus.h again, the code for STATUS_ACCESS_DENIED is 0xC0000022. Running wt can take a very long time, so I used the -l switch, which limits the number of levels deep it will display. This would be something like using tct as I did above until you were a few calls deep then using pct. Using wt -l 3 -oR gave me the following:


    575      291       [ 1] kernel32!CreateProcessInternalW
    35        0           [ 2] kernel32!BasepCheckWinSaferRestrictions
    25        0           [ 3] ntdll!RtlEnterCriticalSection eax = 0
    48        25         [ 2] kernel32!BasepCheckWinSaferRestrictions
    1          0           [ 3] ntdll!NtOpenThreadToken
    3          0           [ 3] ntdll!ZwOpenThreadToken eax = ffffffff`c000007c
    57        29         [ 2] kernel32!BasepCheckWinSaferRestrictions
    1          0           [ 3] ntdll!ZwOpenProcessToken
    3          0           [ 3] ntdll!NtOpenProcessToken eax = 0
    116      33         [ 2] kernel32!BasepCheckWinSaferRestrictions
    113      0           [ 3] ADVAPI32!SaferIdentifyLevel eax = 0
    130      146       [ 2] kernel32!BasepCheckWinSaferRestrictions
    4           0          [ 3] ntdll!ZwClose eax = 0
    132      150       [ 2] kernel32!BasepCheckWinSaferRestrictions
    22        0           [ 3] ntdll!RtlLeaveCriticalSection eax = 0
    138      172       [ 2] kernel32!BasepCheckWinSaferRestrictions
    5           0          [ 3] kernel32!__security_check_cookie eax = ffffffff`c0000022

     Now we are getting close! I set a new breakpoint for kernel32!BasepCheckWinSaferRestrictions and reran the test. This gave me the following line:

    63 0 [ 3] ADVAPI32!__CodeAuthzpCheckIdentityHashRules eax = ffffffff`c0000022

    One last run with a new breakpoint at ADVAPI32!__CodeAuthzpCheckIdentityHashRules and I found what I was looking for:

    58 218 [ 1] ADVAPI32!__CodeAuthzpEnsureMapped eax = ffffffff`c0000022

    The depth is shown in brackets. As this call was 1 deep from __CodeAuthzpCheckIdentityHashRules and I was using 3 as my maximum depth in wt, I knew this is where the STATUS_ACCESS_DENIED was coming from. I reviewed the source code and found that this is the code that performs Software Restriction Policy checking. Specifically, we are attempting to map the executable into memory to perform hash checking on it. Since there isn’t 1.8 GB of contiguous available memory, it failed. Looking at the VM, I discovered that the customer had implemented a number of software restriction policies. As a test, I removed their restrictions on the VM, and the programs ran successfully. A search of the KB revealed that a hotfix was published for this problem: 973825. Installing the hotfix in the article also resolved the issue with their policies intact.

    -Matt Burrough

Page 1 of 1 (2 items)