This was inspired from a case where we had multiple perf issues ( CPU utilization )
It is important to make sure you know where your problem really is, and 99% of the time I rely on perfmon to get accurate data related to this.
I generally think of these 4 components as potential bottlenecks: Disk, Network, Processor , and Memory.
Most of the items I work on, are CPU\Memory issues, and in these cases it important to know what the problem ( these are really symptoms from my POV) really is.
· Is it high CPU – pegged 100% for prolonged periods?
· What process is it?
· Is it CPU spikes? We hit 100% for a few min. and then down and up again?
· Is it abnormal CPU – but not 100%? Maybe 40-60% either steady or spiking?
· When it comes to memory – is it a memory leak or you just think the process should not be using “that much memory”
· This is important to understand, as a memory leak will continue to grow, but the other will plateau, and troubleshooting ( if even necessary ) would differ.
I can't stress how important it is to know exactly what is going on when it comes to the bullet points above. It is a complete waste to move forward based on assumptions or "we think it has high CPU at such and such a time..."
Perfmon can track all of the scenarios I listed above and a depending on how long the problem lasts, a good 1-3 day perfmon will be my best source of data to really find out whats going on ( from a high level )
In addition you can enable additional heap counters to assist in memory issues ( like fragmentation ). These are really nice sometimes...BTW
Abnormal CPU from services.exe - not hanging the machine at 100% but a continuous 40-60% usage.
Windows 2000 OS – SP4 with SRP
So, you know what process it is, but how to find out what it is doing?
For some issues you can use something like kernrate.exe
Output is something like:
----- Zoomed module ntoskrnl.exe (Bucket size = 16 bytes, Rounding Down) --------Percentage in the following table is based on the Total Hits for this Zoom Module
Time 292 hits, 25000 events per hit -------- Module Hits msec %Total Events/SecCcUnpinDataForThread 32 5796 10 % 138026KiDispatchInterrupt 27 5796 8 % 116459ZwYieldExecution 18 5796 5 % 77639FsRtlIsNameInExpression 14 5796 4 % 60386KiIpiServiceRoutine 13 5796 4 % 56073SeUnlockSubjectContext 9 5796 2 % 38819NtAllocateVirtualMemory 8 5796 2 % 34506ObReferenceObjectByHandle 8 5796 2 % 34506ExAllocatePoolWithTag 8 5796 2 % 34506NtRequestWaitReplyPort 7 5796 2 % 30193ExInterlockedPopEntrySList 7 5796 2 % 30193SeDeleteAccessState 6 5796 1 % 25879
As you can see, we have a lot of detail from this log, but I don’t get the whole picture. Many times I need to see WHY we end up in a particular call...nice to see a stack to really know whats going on.
I tend to obtain a userdump ( for user mode apps like AD ( lsass.exe ) obviously ) to find out whats going on, since it is not invasive ( most times ) and does not require me to do a live debug ( hard on a customer )
1. Use adplus -hang to dump services.exe 2. Wait for this to finish dumping Services.exe 3. When it has completed the dump, is services.exe CPU still high? Sometimes when you pause the process ( while the process dumps ) the CPU
usage will go down and not go up again.
4. After the first userdump, and if CPU is still high, then get anther one. And maybe another for good measuer :)
What does this do for me?
One thing is adplus will do, is run the !runaway command and log it in the “PID-876__<process name>.EXE__Date_12-24-2005__Time_11-57-34AM.log”
This extension is a quick way to find out which threads are spinning out of control or consuming too much CPU time.The display identifies each thread by the debugger's internal thread numbering and by the thread ID in hexadecimal.
0:000> * ------ Thread CPU usage information ------
User Mode Time
5:388 0 days 0:00:00.050
9:c14 0 days 0:00:00.010
8:930 0 days 0:00:00.010
7:5bc 0 days 0:00:00.010
1:374 0 days 0:00:00.010
6:38c 0 days 0:00:00.000
4:380 0 days 0:00:00.000
3:37c 0 days 0:00:00.000
2:378 0 days 0:00:00.000
0:370 0 days 0:00:00.000
And, as opposed to a single dump, which only gives me that specific point in time, I can see what the process is doing over a period of time and perhaps see a common code loop over that time period.
In this specific case of high CPU for services.exe – I saw that we were processing a list of items for event logging – but in this case it was excessive - over 100,000 entries.
I don’t have a copy of the stack to show here, but I could see a common thread in each dump, where we were processing the exact same function in all 3 dumps.
Looking at the code I could see we were in a loop processing this list, and from the data in the dump I could see that they all were for the source “rasctr”.
In the end I tracked it back to a bug in Win2k
Services.exe - high CPU due to bug in rasctrs http://support.microsoft.com/kb/835446\
One more note - sometimes ( especially if I can get a live debug ) I like to quickly use qslice to ID the process and thread right off the bat. It has low overhead and with a pegged cpu - that's a good thing.
High LSASS CPU utilization.
Different folks like to tshoot this in different ways. Here are my preferred steps and reasoning.
Sometimes, if you are really lucky, you can relate these trends to some known process ( "Hey that’s when we do <insert action> every day as well!" )
Next – determine if the CPU utilization is caused by some local versus remote action. In most cases I will ask them to simply unplug the NIC ( not always possible ).
If you unplug the NIC and cpu drops – well you know its remote and a net trace would be my next step.
Once you have a good idea of the behavior and patterns from a high level - you can zoom in for a short period of time ( 3-5 minutes ) with SPA. This will aid in finding cause.
Some additional tools to use in these cases
Determined that the source was remote and took a net trace – I don’t like huge traces because they are too large to manage – just 2-3 minutes is all I ask for.
Generally the first thing I will do is run it though netmon’s “experts”. Just right click on the trace and select “launch experts”. This is only available in the full version of netmon.
Top users will tell me who is taking the lions share in \out for this DC.
Here are a few examples of this method and what was turned up from it:
Excessive Kerberos requests from clients workstations.
They had disabled 1500 employees accounts and the workstations of these user were hammering the DC's ( later found to be a specific app on the machines )
10.20.47.81 - is the client with 681 frames in 3 minutes.
KERBEROS: Server name (sname) =krbtgt/SomeCompany.NET
KERBEROS: Host address =NetBIOS: RU08-122-A
KERBEROS: Error code (error-code) = Clients credentials have been revoked
Now based on our 3 minute test - we have ~700 frames per request \ response.
Assume half are requests and the other half the server side response.
350(requests) * 3 (clients) = 1050 (requests)
1050 request per 3 minutes = 21000 requests per hour from just 3 clients
Excessive replication from another DC due to ProxyAddresses - I've seen this more than once
May also see events like:
DCB00001-001\DCB00001227 via RPC
Last attempt @ 2005-04-27 09:03.28 failed, result 8461:
The replication operation was preempted.
Last success @ 2005-04-22 04:19.34.
0 consecutive failure(s).
QUESTION- what SPECIFICALLY are we getting from these DC’s upstream partners ? How much data and what data?
repadmin /showchanges DC02 a24e121b-e3cd-4f81-9a17-f35a43a25e4f DC=corp,DC=internal,DC=mycompany,DC=com > changes.txt
changes.txt shows many many users who’s ProxyAddresses attribute ( multivalued ) was being replicated and each user’s attribute had 700-~850 entries.
Overhead on DC from SQL Server excessive queries for SID lookups
NOTE: If you see this same pattern of traffic from random XP machines – there is a bug in the system restore service which will overload a DC with these requests see - q834288.
2 minute trace shows 368 lookups
Majority of these are:
SID lookups for these users\groups: < changed the names of groups to protect the innocent>
Character Array: Group1
Character Array: Group2
Character Array: Group3
Character Array: Group4
Character Array: Group5
Character Array: Group6
Character Array: Group7
Character Array: Group8
Character Array: Group9
368 lookups successfully retrieved.
Each request appears to take about 30 frames. -- rough estimates say this accounts for ~85% of the traffic from this server.
I use a combination of netmon and ethereal.
In this case I wanted to be able to show them what exact queries and groups were being resolved. In order to do this:
· I filtered the data so it was just the LookupSid responses from the DC ( you will see this clearly in the trace ) and saved this off/
· Opened the filtered trace in ethereal and did a print -> to file -> all packets expanded option.
· Findstr /i /c:”Character Array” data.txt > groups.txt --- opening the text file showed all the lines where the name was resolved had this in the line.
Excessive inefficient LDAP queries from SAMBA server to DC's.
IP: Source Address = 10.10.10.12
LDAP: Filter Type = And
LDAP: Filter Type = Equality Match
LDAP: Attribute Type =objectclass
LDAP: Attribute Value =posixAccount
LDAP: Attribute Type =uid
LDAP: Attribute Value =MAILER-DAEMON
LDAP: Attribute Description List
LDAP: Attribute Type =uid
LDAP: Attribute Type =userPassword
LDAP: Attribute Type =uidNumber
LDAP: Attribute Type =gidNumber
LDAP: Attribute Type =cn
LDAP: Attribute Type =homeDirectory
LDAP: Attribute Type =loginShell
LDAP: Attribute Type =gecos
LDAP: Attribute Type =description
LDAP: Attribute Type =objectClass
15,089 frames from this machine all the same query.
There are times when you unplug and CPU won’t drop.
So the source is some local processing from LSA... In those cases I generally use the same method described in Scenario 1 above ( the services.exe example ) with userdump etc..
A few times I can remember:
Bug in the code which processed group expansion.
Wow – since this is so long winded I will talk about memory problems another time. Basically it comes down to: “ If it is a memory leak – use UMDH
I'm sure there are a lot of other ways to tshoot any of these type of problems, but I thought I would present a few methods I have used in the past
Hope this is interesting to someone....
Nice Article. I've a question regarding the kernrate tool though - I'd be glad if you could provide more details -
what bucket size have you typically found useful? Is there a rule of thumb on what bucket size to use? How do you know if we need to use a bigger/smaller number for bucket size?
Do you know of a tool like QSlice that will display activity for each CPU or that you can choose which CPU it will display. I have a server with 16 CPUs and no services appear to be having an impact yet several times an hour at random one CPU (usually CPU 10, 11 or 12) will almost flatline for up to a minute which sets off our monitoring software.
Why do you care about which CPU specifically is impacted? Qslice wont give the data, as you mentioned, but perfmon will give you the per processor utilization. This doesnt really help find out what is causing it though, does that matter in this case?
Very nice article - I like the tip about - unplugging the NIC - to check for local/remote sources for the problems.
In XP, probably you can also disable/enable the LAN in the Network Connections window.