Spat's WebLog (Steve Patrick)

When things go wrong...

What's bogging down your CPU?

What's bogging down your CPU?

  • Comments 4

This was inspired from  a case where we had multiple perf issues ( CPU utilization )

 

It is important to make sure you know where your problem really is, and 99% of the time I rely on perfmon to get accurate data related to this.

 

I generally think of these 4 components as potential bottlenecks: Disk, Network, Processor , and Memory.

 

Most of the items I work on,  are CPU\Memory issues,  and in these cases it important to know what the problem (  these are  really symptoms  from my POV)  really is.

 

·         Is it high CPU – pegged 100% for prolonged periods?

·         What process is it?

·         Is it CPU spikes? We hit 100% for a few min. and then down and up again?

·         Is it abnormal CPU – but not 100%? Maybe 40-60%  either steady  or spiking?

 

 

·         When it comes to memory – is it a memory leak or you just think the process should not be using “that much memory”

·         This is important to understand, as a memory leak will continue to grow, but the other will plateau, and troubleshooting ( if even necessary ) would differ.

 

 

I can't stress how important it is to know exactly what is going on when it comes to the bullet points above. It is a complete waste to move forward based on assumptions or "we think it has high CPU at such and such a time..."

 

Perfmon can track all of the scenarios I listed above and a depending on how long the problem lasts, a good 1-3 day perfmon will be my best source of data to really find out whats going on ( from a high level )

 

In addition you can enable additional heap counters to assist in memory  issues ( like fragmentation ). These are really nice sometimes...BTW

 

 

Scenario 1:

=======

Abnormal CPU from services.exe  - not hanging the machine at 100% but a continuous 40-60% usage.

Windows 2000 OS – SP4 with SRP

 

So, you know what process it is, but how to find out what it is doing?

For some issues you can use something like kernrate.exe

 

Output is something like:

<snipped>

----- Zoomed module ntoskrnl.exe (Bucket size = 16 bytes, Rounding Down) --------
Percentage in the following table is based on the Total Hits for this Zoom Module

Time   292 hits, 25000 events per hit --------
 Module                                Hits   msec  %Total  Events/Sec
CcUnpinDataForThread                     32       5796    10 %      138026
KiDispatchInterrupt                      27       5796     8 %      116459
ZwYieldExecution                         18       5796     5 %       77639
FsRtlIsNameInExpression                  14       5796     4 %       60386
KiIpiServiceRoutine                      13       5796     4 %       56073
SeUnlockSubjectContext                    9       5796     2 %       38819
NtAllocateVirtualMemory                   8       5796     2 %       34506
ObReferenceObjectByHandle                 8       5796     2 %       34506
ExAllocatePoolWithTag                     8       5796     2 %       34506
NtRequestWaitReplyPort                    7       5796     2 %       30193
ExInterlockedPopEntrySList                7       5796     2 %       30193
SeDeleteAccessState                       6       5796     1 %       25879

 

As you can see, we have a lot of detail from this log, but I don’t get the whole picture. Many times I need to see WHY we end up in a particular call...nice to see a stack to really know whats going on.

 

I tend to obtain a userdump ( for user mode apps like AD ( lsass.exe )  obviously ) to find out whats going on, since it is not invasive ( most times ) and does not require me to do a live debug ( hard on a customer )

 

1.        Use adplus -hang to  dump services.exe
2.        Wait for this to finish dumping Services.exe
3.        When it has completed the dump, is services.exe CPU still high? Sometimes when you pause the process ( while the process dumps ) the CPU

            usage will go down and not go up again.

4.         After the first userdump, and if CPU is still high, then get anther one.  And maybe another for good measuer :)

 

 

What does this do for me?

 

One thing is adplus will do, is run the !runaway command and log it in the  “PID-876__<process name>.EXE__Date_12-24-2005__Time_11-57-34AM.log”

This extension is a quick way to find out which threads are spinning out of control or consuming too much CPU time.
The display identifies each thread by the debugger's internal thread numbering and by the thread ID in hexadecimal.

 

0:000> * ------ Thread CPU usage information ------

0:000> !runaway

 User Mode Time

  Thread       Time

   5:388       0 days 0:00:00.050

   9:c14       0 days 0:00:00.010

   8:930       0 days 0:00:00.010

   7:5bc       0 days 0:00:00.010

   1:374       0 days 0:00:00.010

   6:38c       0 days 0:00:00.000

   4:380       0 days 0:00:00.000

   3:37c       0 days 0:00:00.000

   2:378       0 days 0:00:00.000

   0:370       0 days 0:00:00.000

 

And, as opposed to a single dump, which only gives me that specific point in time, I can see what the process is doing over a period of time and perhaps see a common code loop over that time period.

 

In this specific case of high CPU for services.exe – I saw that we were processing a list of  items for event logging – but in this case it was excessive  - over 100,000 entries.

 

I don’t have a copy of the stack to show here, but I could see a common thread in each dump, where we were processing the exact same function in all 3 dumps.

 

Looking at the code I could see we were in a  loop processing this list, and from the data in the dump I could see that they all were for the  source “rasctr”. 

 

In the end I tracked it back to a bug in Win2k

Services.exe - high CPU due to bug in rasctrs     http://support.microsoft.com/kb/835446\

 

 

 

One more note - sometimes ( especially if I can get a live debug ) I like to quickly use qslice to ID the process and thread right off the bat. It has low overhead and with a pegged cpu - that's a  good thing.

 

 

Scenario #2

=========

High LSASS CPU utilization.  

 

 

Different folks like to tshoot this in different ways. Here are my preferred steps and reasoning.

 

  • First – get a perfmon to monitor the actual usage and pattern.
  • Does it follow the business day?
  • Does it only occur during certain hours of the day?

 

Sometimes, if you are really lucky, you can relate these trends to some known process ( "Hey that’s when we do <insert action> every day as well!" )

 

Next determine if the CPU utilization is caused by some local versus remote action. In most cases I will ask them to simply unplug the NIC ( not always possible ).

If you unplug the NIC and cpu drops – well you know its remote and a net trace would be my  next step.

 

Once you have a good idea of the behavior and patterns from a high level - you can zoom in for a short period of time ( 3-5 minutes ) with SPA. This will aid in finding cause.

 

Some additional tools to use in these cases

 

 

Remote source:

 

Determined that the source was remote and took a net trace – I don’t like huge traces because they are too large to manage – just 2-3 minutes is all I ask for.

Generally the first thing I will do is run it though netmon’s “experts”. Just right click on the trace and select “launch experts”. This is only available in the full version of netmon.

 

 

 

 

 

Top users will tell me who is taking the lions share in \out for this DC.

 

Here are a few examples of this method and what was turned up from it:

 

Excessive Kerberos requests from clients workstations.

They had disabled 1500 employees accounts and the workstations of these user were hammering the DC's ( later found to be a specific app on the machines )

 

SAMPLE:

10.20.47.81 - is the client with 681 frames  in 3  minutes.

 

Traffice :

KERBEROS: Server name (sname[3]) =krbtgt/SomeCompany.NET

KERBEROS: Host address =NetBIOS: RU08-122-A     

KERBEROS: Error code (error-code[6]) = Clients credentials have been revoked

 

Now based on our 3 minute test - we have ~700 frames per request \ response.

Assume half are requests and the other half the server side response.

 

 350(requests) * 3 (clients)  = 1050 (requests)

1050 request per 3 minutes = 21000 requests per hour  from just 3 clients

 

--------------

 

Excessive replication from another DC due to ProxyAddresses - I've seen this more than once

 

 

May also see events like:

 

DC=corp,DC=internal,DC=mycompany,DC=com

    DCB00001-001\DCB00001227 via RPC

        objectGuid: 939ef2fe-62c0-4086-a8fa-61f4ee0f6f7e

        Last attempt @ 2005-04-27 09:03.28 failed, result 8461:

            The replication operation was preempted.

        Last success @ 2005-04-22 04:19.34.

        0 consecutive failure(s).

 

QUESTION- what  SPECIFICALLY are we getting from these DC’s upstream partners ? How much data and what data?

 

repadmin /showchanges DC02  a24e121b-e3cd-4f81-9a17-f35a43a25e4f   DC=corp,DC=internal,DC=mycompany,DC=com > changes.txt

 

changes.txt shows many many users who’s ProxyAddresses attribute ( multivalued ) was being replicated and each user’s attribute had 700-~850 entries.

 

Something like:

 

    737> proxyAddresses:

BadApp:Processed Today102846@;

BadApp:Processed Today113936@;

BadApp:Processed Today80986@

BadApp:Processed Today47008@;

BadApp:Processed Today100442@;

BadApp:Processed Today47258@;

BadApp:Processed Today29570@;

Etc..

 

 

--------------

 

Overhead on DC from SQL Server  excessive queries for SID lookups

 

NOTE: If you see this same pattern of traffic from random XP machines – there is a bug in the system restore service which will overload a DC with these requests see - q834288.

 

2 minute trace shows 368 lookups

 

Majority of these are:

 SID lookups for these users\groups:  < changed the names of groups to protect the innocent>

       Character Array: Group1

Character Array: Group2

Character Array: Group3

Character Array: Group4

Character Array: Group5

Character Array: Group6

Character Array: Group7

Character Array: Group8

Character Array: Group9

etc...

 

368 lookups successfully retrieved.

Each request appears to take about 30 frames. -- rough estimates say this accounts for ~85% of the traffic from this server.

 

I use a combination of netmon and ethereal.

In this case I  wanted to be able to show them what exact queries and groups were being resolved.  In order to do this:

 

·         I filtered the data so it was just the LookupSid responses from the DC ( you will see this clearly in the trace ) and saved this off/

·         Opened the filtered trace in ethereal and did a print -> to file -> all packets expanded option.

·         Findstr /i /c:”Character Array”  data.txt > groups.txt   --- opening the text file showed all the lines where the name was resolved had this in the line.

 

----------

 

Excessive inefficient LDAP queries from SAMBA server to DC's.

 

 

 IP: Source Address = 10.10.10.12

 

        LDAP: Filter

            LDAP: Filter Type = And

                LDAP: Filter Type = Equality Match

                    LDAP: Attribute Type =objectclass

                    LDAP: Attribute Value =posixAccount

                LDAP: Filter Type = Equality Match

                    LDAP: Attribute Type =uid

                    LDAP: Attribute Value =MAILER-DAEMON

        LDAP: Attribute Description List

            LDAP: Attribute Type =uid

            LDAP: Attribute Type =userPassword

            LDAP: Attribute Type =uidNumber

            LDAP: Attribute Type =gidNumber

            LDAP: Attribute Type =cn

            LDAP: Attribute Type =homeDirectory

            LDAP: Attribute Type =loginShell

            LDAP: Attribute Type =gecos

            LDAP: Attribute Type =description

            LDAP: Attribute Type =objectClass

 

 

15,089 frames from this machine all the same query.

 

 

 

 

 

There are times when you unplug and CPU won’t drop.

 

So the source is some local processing from LSA... In those cases I generally use the same method described in Scenario 1 above ( the services.exe example ) with userdump etc..

 A few times I can remember:

 

  • Admin had added ‘domain users’ to ‘print operators’. Why did this effect anything? AdminSDHolder had to process 53,000  users now
  • Database ( NTDS.DIT ) was ‘wonky’ . Ended up looking at the machine from a different perspective and did an integrity check on the DB ,  which failed.
  • Bug in the code which processed group expansion.

 

 

Wow – since this is so long winded I will talk about memory problems another time. Basically it comes down to:  “ If it is a memory leak – use UMDH

 

 

Tools summary:

   Perfmon

   Netmon

   SPA

   ADPlus

   Userdump

   Debugger

   

 

Additonal links:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnactdir/html/efficientadapps.asp

http://download.microsoft.com/download/d/7/b/d7bd44a1-f1e0-477d-8245-9eb3b21a13cd/ILL003.pdf

 

 

 

I'm sure there are a lot of other ways to tshoot any  of these type of problems, but I thought I would present  a few methods I have used in the past

 

Hope this is interesting to someone....

 

 

 

spatdsg

 

Leave a Comment
  • Please add 3 and 1 and type the answer here:
  • Post
  • Nice Article. I've a question regarding the kernrate tool though - I'd be glad if you could provide more details -

    what bucket size have you typically found useful? Is there a rule of thumb on what bucket size to use? How do you know if we need to use a bigger/smaller number for bucket size?

  • Do you know of a tool like QSlice that will display activity for each CPU or that you can choose which CPU it will display.  I have a server with 16 CPUs and no services appear to be having an impact yet several times an hour at random one CPU (usually CPU 10, 11 or 12) will almost flatline for up to a minute which sets off our monitoring software.

  • Why do you care about which CPU specifically is impacted? Qslice wont give the data, as you mentioned, but perfmon will give you the per processor utilization. This doesnt really help find out what is causing it though,  does that matter in this case?

  • Very nice article - I like the tip about - unplugging the NIC - to check for local/remote sources for the problems.

    In XP, probably you can also disable/enable the LAN in the Network Connections window.

Page 1 of 1 (4 items)