If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

ASP.NET Hang scenario: Kernel Debugging for Dummies

ASP.NET Hang scenario: Kernel Debugging for Dummies

  • Comments 8

A local radio show has started giving out cookie points to their interviewees for personal development, and the other day my coworker decided to give me one for debugging a perf issue in kernel-mode...  YAY!!!

Having debugging as a profession, you might have thought that I would have done a lot of kernel debugging, but since I deal mostly with ASP.NET and .NET development most of our issues are in usermode so there is rarely a need to debug the kernel (drivers etc.).  In fact, I have only debugged a few crashes in kernel mode and that is very different from debugging a kernel mode performance issue since debugging crashes usually only involves running !analyze -v to get the crashing stack.

Problem description:

The problem in this case was very simiar to the post I just made on ASP.NET hanging on startup because of antivirus software blocking/delaying I/O operations, and my first suggestion was for them to test the startup on a quarantined machine without the antivirus software, but that didn't do the trick this time.

In this case the ASP.NET process stalls about 5 minutes after a restart of the process or an application domain.

Debugging the issue (user mode):

The usermode dumps we had of the process "hanging" showed a few interesting types of stacks...

The first type of stack, which appeared a lot, was a wait for a compilation mutex


0:109> !clrstack Loaded Son of Strike data table version 5 from "C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\mscorsvr.dll" Thread 109 ESP EIP 0x06f5f328 0x7c82ed54 [FRAME: NDirectMethodFrameStandalone] [DEFAULT] I4 System.Web.UnsafeNativeMethods.InstrumentedMutexGetLock(ValueClass System.Runtime.InteropServices.HandleRef,I4) 0x06f5f338 0x025f332b [DEFAULT] [hasThis] Void System.Web.Compilation.CompilationMutex.WaitOne() 0x06f5f620 0x103e08dc [FRAME: ContextTransitionFrame] 0x06f5f700 0x103e08dc [FRAME: ComMethodFrame] 109 Id: 13a8.2028 Suspend: 1 Teb: 7ff0b000 Unfrozen ChildEBP RetAddr Args to Child 06f5f26c 7c822124 77e6bad8 000003dc 00000000 ntdll!KiFastSystemCallRet 06f5f270 77e6bad8 000003dc 00000000 00000000 ntdll!NtWaitForSingleObject+0xc 06f5f2e0 79e769a3 000003dc ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xac 06f5f2f4 79e76ba5 ffffffff 06f5f328 01c1bbd9 aspnet_isapi!InstrumentedMutex::GetLock+0x10 06f5f300 01c1bbd9 01943a10 ffffffff 00000043 aspnet_isapi!InstrumentedMutexGetLock+0x12 WARNING: Frame IP not in any known module. Following frames may be wrong. 06f5f42c 799b8dad 00000000 08c9b1dc 06f5f4a0 0x1c1bbd9 06f5f4cc 799a67a2 103e44f8 103e47fc 799b4168 mscorlib_79990000+0x28dad 00000000 00000000 00000000 00000000 00000000 mscorlib_79990000+0x167a2

This basically means that we are in the middle of compiling the site and we are waiting for compilation to finish.  Sometimes something happens during compilation such that we get stuck here indefinitely, and this something is usually some kind of exception or blocking caused by doing too many concurrent compilations. For example lots of site changes under heavy load.  Looking at this I asked them about their appdomain recycling details, i.e. things that would make the application domain recycle very often, and there were a few things like frequent updates of aspx pages, but nevertheless they got these blockings at startup even during regular recycles with only one user on the site. 

If you do see this type of blocking though, you should consider looking at this post about application recycling to make sure that you are not doing any unneccesary recycles during heavy load.   Reducing these will resolve this type of blocking in 90% of the cases... but again, in this case the blocking occurred even if they just did a single recycle without being under heavy load, so let's look at the next stack...  

The next set of stacks are also related to compilation. Here we are processing an ASP.NET request who's domain is not yet fully set up so we are blocking trying to get a reference to the domain...


0:081> kb ChildEBP RetAddr Args to Child 286cf5f8 7c822114 77e67143 00000001 286cf648 ntdll!KiFastSystemCallRet 286cf5fc 77e67143 00000001 286cf648 00000000 ntdll!NtWaitForMultipleObjects+0xc 286cf6a4 791e0b3b 00000001 0aaabd34 00000001 kernel32!WaitForMultipleObjectsEx+0x11a 286cf6d4 791e0bdd 00000001 0aaabd34 00000001 mscorsvr!Thread::DoAppropriateWaitWorker+0xc1 286cf728 791fccfe 00000001 0aaabd34 00000001 mscorsvr!Thread::DoAppropriateWait+0x46 286cf7ac 791fcc17 0a91a118 ffffffff 00000000 mscorsvr!AwareLock::EnterEpilog+0x9d 286cf7c8 791fd43e 1887a8f0 27567600 18879508 mscorsvr!AwareLock::Enter+0x78 286cf85c 025459f2 18879508 1887aa04 1887abdc mscorsvr!JITutil_MonContention+0x124 WARNING: Frame IP not in any known module. Following frames may be wrong. 286cf8e0 79217188 00000000 02b89fe8 1085a520 0x25459f2 286cf8c4 02146468 187e4c04 1887a638 02142fe5 mscorsvr!ComCallMLStubCache::CompileMLStub+0x429 286cf954 7923c206 02142f80 286cfa5c 06d0b3d8 0x2146468 286cf994 791cc3c8 286cfa04 791b3d64 286cfa80 mscorsvr!ComCallMLStubCache::CompileMLStub+0x1af 286cf9dc 7923c1c3 06d0b3d8 7923c1ce 286cfa04 mscorsvr!Thread::DoADCallBack+0x5c 286cfa44 01bda0bd 0a91a118 286cfa80 286cffdc mscorsvr!ComCallMLStubCache::CompileMLStub+0x2ba 286cfb48 77e660fe 00000000 020238d8 00000001 0x1bda0bd 286cfb8c 75da1408 00000000 00000000 286cfbdc kernel32!FlsSetValue+0xc7 75da1401 8375da10 89ff044e 40c03306 c25d5e5f sxs!_CRT_INIT+0x185 75da1401 00000000 89ff044e 40c03306 c25d5e5f 0x8375da10

0:081> !clrstack Thread 81 ESP EIP 0x286cf754 0x7c82ed54 [FRAME: GCFrame] 0x286cf810 0x7c82ed54 [FRAME: HelperMethodFrame] 0x286cf864 0x025459f2 [DEFAULT] Class System.Web.IHttpHandler System.Web.HttpApplicationFactory.GetApplicationInstance(Class System.Web.HttpContext) 0x286cf890 0x0214686f [DEFAULT] [hasThis] Void System.Web.HttpRuntime.ProcessRequestInternal(Class System.Web.HttpWorkerRequest) 0x286cf8cc 0x02146468 [DEFAULT] Void System.Web.HttpRuntime.ProcessRequest(Class System.Web.HttpWorkerRequest) 0x286cf8d8 0x02142fe5 [DEFAULT] [hasThis] I4 System.Web.Hosting.ISAPIRuntime.ProcessRequest(I,I4) 0x286cf9a0 0x79217188 [FRAME: ContextTransitionFrame] 0x286cfa80 0x79217188 [FRAME: ComMethodFrame]

If you remember my post on locks and critical sections you will probbly recall that the call to AwareLock::Enter above means that we are waiting on a thread holding a lock or a montor and that we can find out who is holding the lock by running !syncblk... 


0:081> !syncblk Index SyncBlock MonitorHeld Recursion Thread ThreadID Object Waiting 6840 0x0aaabd20 145 1 0xaa4fed8 0x360 39 0x1c316400 System.Web.HttpApplicationFactory Waiting threads: 21 22 23 24 25 38 40 42 43 44 45 46 47 48 49 50 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 ----------------------------- Total 6941 ComCallWrapper 3 ComPlusWrapper 0 ComClassFactory 0 Free 1329

So we have 72 ((145-1) / 2) threads waiting for a lock on the type HttpApplicationFactory held by thread 39.


0:081> ~39s eax=791b6a40 ebx=06da8fd0 ecx=0e8cf360 edx=024002b0 esi=242edcd8 edi=00000000 eip=7c82ed54 esp=0e8cf25c ebp=0e8cf2d8 iopl=0 nv up ei pl zr na pe nc cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246 ntdll!KiFastSystemCallRet: 7c82ed54 c3 ret 0:039> kb ChildEBP RetAddr Args to Child 0e8cf258 7c821904 77e612df 000014ac 00000000 ntdll!KiFastSystemCallRet 0e8cf25c 77e612df 000014ac 00000000 00000000 ntdll!ZwQueryDirectoryFile+0xc 0e8cf2d8 01bda78a 242edcd8 0aaf4540 00000004 kernel32!FindNextFileW+0xaa WARNING: Frame IP not in any known module. Following frames may be wrong. 0e8cf3d0 02548e0b 1c94d790 1086eae8 1086eae8 0x1bda78a 0e8cf418 02548cfa 1c316734 1086eae8 1030fe20 0x2548e0b 0e8cf43c 02548c1e 0e8cf46c 1c94d790 1086eae8 0x2548cfa 0e8cf440 0e8cf46c 1c94d790 1086eae8 1c316668 0x2548c1e 0e8cf444 1c94d790 1086eae8 1c316668 025489ac 0xe8cf46c 0e8cf544 791b8bc6 108686f8 791b8bcf 00000000 0x1c94d790 0e8cf5e0 79217188 00000000 00e41000 10861d1c mscorsvr!COMString::EqualsString+0x3f 0e8cf654 7923c206 02142f80 0e8cf75c 06d0b3d8 mscorsvr!ComCallMLStubCache::CompileMLStub+0x429 0e8cf694 791cc3c8 0e8cf704 791b3d64 0e8cf780 mscorsvr!ComCallMLStubCache::CompileMLStub+0x1af 0e8cf6dc 7923c1c3 06d0b3d8 7923c1ce 0e8cf704 mscorsvr!Thread::DoADCallBack+0x5c 0e8cf744 01bda0bd 0aa4fed8 0e8cf780 0e8cffdc mscorsvr!ComCallMLStubCache::CompileMLStub+0x2ba 0e8cf828 791cc195 791cc1d4 0aa4fed8 0aa4fed8 0x1bda0bd 0e8cf854 791cc1f5 000d9948 000d9948 0e8cf870 mscorsvr!EEHashTable,0>::FindItem+0x1b 0e8cf864 791cc204 000d9948 0e8cf890 791cc243 mscorsvr!EEHashTable,0>::FindItem+0xe 0e8cf870 791cc243 000d9948 0e8cf88c 0aa4fed8 mscorsvr!EEHashTable,0>::GetValue+0xb 0e8cf890 791cc310 000d9948 00000000 791cc32b mscorsvr!Thread::SetStaticData+0x5c 0e8cf8f0 0e8cf918 791cc122 791cc131 0ad15d54 mscorsvr!Thread::SafeSetStaticData+0x63



0:039> !clrstack Thread 39 ESP EIP 0x0e8cf360 0x7c82ed54 [FRAME: NDirectMethodFrameSlim] [DEFAULT] Boolean System.Web.UnsafeNativeMethods.FindNextFile(I,ByRef ValueClass WIN32_FIND_DATA) 0x0e8cf370 0x02549395 [DEFAULT] Void System.Web.Compilation.PreservedAssemblyEntry.RemoveOldTempFiles() 0x0e8cf3d8 0x02548e0b [DEFAULT] Void System.Web.Compilation.PreservedAssemblyEntry.DoFirstTimeInit(Class System.Web.HttpContext) 0x0e8cf420 0x02548cfa [DEFAULT] Void System.Web.Compilation.PreservedAssemblyEntry.EnsureFirstTimeInit(Class System.Web.HttpContext) 0x0e8cf444 0x02548c1e [DEFAULT] Class System.Web.Compilation.PreservedAssemblyEntry System.Web.Compilation.PreservedAssemblyEntry.GetPreservedAssemblyEntry(Class System.Web.HttpContext,String,Boolean) 0x0e8cf45c 0x025489ac [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemFromPreservedCompilation() 0x0e8cf474 0x025474c3 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemInternal(Boolean) 0x0e8cf49c 0x025488fe [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItemWithNewConfigPath() 0x0e8cf4cc 0x02547394 [DEFAULT] [hasThis] Class System.Web.UI.ParserCacheItem System.Web.UI.TemplateParser.GetParserCacheItem() 0x0e8cf4fc 0x02545e3c [DEFAULT] Class System.Type System.Web.UI.ApplicationFileParser.GetCompiledApplicationType(String,Class System.Web.HttpContext,ByRef Class System.Web.UI.ApplicationFileParser) 0x0e8cf514 0x02545cfc [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.CompileApplication(Class System.Web.HttpContext) 0x0e8cf524 0x02545c07 [DEFAULT] [hasThis] Void System.Web.HttpApplicationFactory.Init(Class System.Web.HttpContext) 0x0e8cf564 0x02545a0b [DEFAULT] Class System.Web.IHttpHandler System.Web.HttpApplicationFactory.GetApplicationInstance(Class System.Web.HttpContext) 0x0e8cf590 0x0214686f [DEFAULT] [hasThis] Void System.Web.HttpRuntime.ProcessRequestInternal(Class System.Web.HttpWorkerRequest) 0x0e8cf5cc 0x02146468 [DEFAULT] Void System.Web.HttpRuntime.ProcessRequest(Class System.Web.HttpWorkerRequest) 0x0e8cf5d8 0x02142fe5 [DEFAULT] [hasThis] I4 System.Web.Hosting.ISAPIRuntime.ProcessRequest(I,I4) 0x0e8cf6a0 0x79217188 [FRAME: ContextTransitionFrame] 0x0e8cf780 0x79217188 [FRAME: ComMethodFrame]

Thread 39 is processing an ASP.NET request. As part of doing so, since this is the first request, it has to compile the app, and remove the old temporary asp.net files in the C:\WINDOWS\Microsoft.NET\Framework\v1.1.4322\Temporary ASP.NET Files folder.

This is where it gets interesting, because looking at the timestamp of the request it has been working on this for several minutes...

To see exactly how long the request has been running we can run the following commands...


0:039> !dso Thread 39 ESP/REG Object Name ... 0xe8cf5a4 0x1030886c System.Web.Hosting.ISAPIWorkerRequestInProcForIIS6 0xe8cf5a8 0x1030b6cc System.Web.HttpRuntime 0xe8cf6b4 0x10308850 System.Runtime.Remoting.Messaging.LogicalCallContext 0:039> !do 0x1030886c Name: System.Web.Hosting.ISAPIWorkerRequestInProcForIIS6 MethodTable 0x02401fdc EEClass 0x02177fd0 Size 200(0xc8) bytes GC Generation: 2 mdToken: 0x02000115 (c:\windows\assembly\gac\system.web\1.0.5000.0__b03f5f7f11d50a3a\system.web.dll) FieldDesc*: 0x02401e68 MT Field Offset Type Attr Value Name 0x01c08a78 0x4000798 0x4 VALUETYPE instance start at 10308870 _startTime ... 0:039> !convertvtdatetodate 10308870 As a TimeSpan: 732847.09:25:40.5338928 As a DateTime: 06/21/2007 09:25:40

So the request started at 09:25:40 GMT, and the dump was taken at 09:28:51 GMT (11:28:51 GMT+2)... 


0:039> .time Debug session time: Thu Jun 21 11:28:51.000 2007 (GMT+2) System Uptime: 8 days 15:51:05.015 Process Uptime: 0 days 5:28:36.000 Kernel time: 0 days 0:15:54.000 User time: 0 days 0:02:23.000

...meaning that the request has been executing for ~3 minutes

 The obvious question is, why is it taking so long...

Going back to the native callstack we had the following at the top 


0e8cf25c 77e612df 000014ac 00000000 00000000 ntdll!ZwQueryDirectoryFile+0xc
0e8cf2d8 01bda78a 242edcd8 0aaf4540 00000004 kernel32!FindNextFileW+0xaa ...

 

The fact that ntdll!ZwQueryDirectoryFile is on top of the stack means that we are now operating in kernel mode, trying to get some kind of information about a file (which in this case is one of the files in the temporary asp.net files based on the .net callstack).

This is usually a super quick operation so it would rarely be caught in a dump unless something is delaying it and making it slow, for example a driver for antivirus scanning, backup software or some other type of low level software that intercepts file access for monitoring.

Since the usual suspects (antivirus scanners) were discarded, the next step was to debug at the kernel level.

 

Before I continue here I want to say that I would recommend checking the usual suspects first since it is a bit of a hassle to set up kernel debugging, but if all else fails, here we go...

 

Kernel debugging

The first and probably most difficult step is to get the server configured, and I had the customer follow this kb article to set the server up to take kernel dumps on a specific key combination.

I was thoroughly impressed at the speed that this was done, being that it was a production server, and when I got the dump, the fun began since I had to scavange the help files to figure out what commands to use in a kernel dump.  It is a completely different experience than debugging a usermode dump.

First things first,  I still opened the dump in windbg with File/Open Crash Dump

A regular kernel dump, unlike a usermode dump is a dump of all processes on the system, but only the kernel space.  I am sure that there are ways to get everything, in fact I know this, but all we needed here was the kernel mode space.

Since we know or assume that we are blocking we can run !locks

0: kd> !locks
**** DUMP OF ALL RESOURCE OBJECTS ****
KD: Scanning for held locks.................................................................................................................................................................................

Resource @ 0x88e9a518    Shared 1 owning threads
    Contention Count = 1
     Threads: 88e889a0-01<*> 
KD: Scanning for held locks............................
6541 total locks, 1 locks currently held

This tells us that there is a lock currently held by thread 88e889a0...

0: kd> !thread 88e889a0
THREAD 88e889a0  Cid 0d6c.06d4  Teb: 7ffaa000 Win32Thread: 00000000 WAIT: (Unknown) KernelMode Non-Alertable
    b954f1ec  NotificationEvent
IRP List:
    89e73b18: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88d4d008: (0006,01d8) Flags: 00000000  Mdl: 00000000
    89055008: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88f1d6d8: (0006,01d8) Flags: 00000000  Mdl: 00000000
    8a6aac38: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88f77e20: (0006,01d8) Flags: 00000000  Mdl: 00000000
    8a5f5880: (0006,01d8) Flags: 00000000  Mdl: 00000000
    8a63ce20: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88d4ac40: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88f41988: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88eecb80: (0006,01d8) Flags: 00000000  Mdl: 00000000
    88f1e008: (0006,01d8) Flags: 00000000  Mdl: 00000000
Not impersonating
DeviceMap                 e1b905a0
Owning Process            88d538d0       Image:         w3wp.exe
Wait Start TickCount      22980          Ticks: 1 (0:00:00:00.015)
Context Switch Count      12683             
UserTime                  00:00:01.656
KernelTime                00:00:00.656
Win32 Start Address 0x791fda8b
Start Address 0x77e6b5f3
Stack Init b9550000 Current b954ef48 Base b9550000 Limit b954d000 Call 0
Priority 8 BasePriority 8 PriorityDecrement 0
ChildEBP RetAddr  Args to Child              
b954ef60 8083e6a0 88e88a18 88e889a0 88e88a48 nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
b954ef8c 8083f154 89051e28 b954f2b0 89fe7d90 nt!KiSwapThread+0x284 (FPO: [Non-Fpo])
b954efd4 f7253fa3 b954f1ec 00000000 00000000 nt!KeWaitForSingleObject+0x346 (FPO: [Non-Fpo])
b954eff4 f7253712 b954f2b0 00001000 89fe7d90 Ntfs!NtfsWaitSync+0x1c (FPO: [Non-Fpo])
b954f1b0 f725175e b954f2b0 89051e28 89fe7d90 Ntfs!NtfsNonCachedIo+0x2f3 (FPO: [Non-Fpo])
b954f29c f72548de b954f2b0 89051e28 00000001 Ntfs!NtfsCommonRead+0xaf5 (FPO: [Non-Fpo])
b954f448 8083f9c0 89fe3248 89051e28 8a696910 Ntfs!NtfsFsdRead+0x113 (FPO: [Non-Fpo])
b954f45c f7323c53 8a696910 89eff9e8 b954f41c nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
b954f484 8083f9c0 8a562708 89051e28 89051fdc fltmgr!FltpDispatch+0x6f (FPO: [Non-Fpo])
b954f498 f7730da7 89eff9e8 89051fdc 89051e28 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be wrong.
b954f4c8 f773181e 89eff9e8 89051e28 00000ae4 SomeFiltr!OsrFilterDeregister+0x85
b954f528 8083f9c0 89eff930 89051e28 89051e28 SomeFiltr!OsrFilterDeregister+0xafc
b954f53c 8082f0d7 8a7387a8 88e889a0 8a738798 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
b954f554 8082f175 89fe8a0a 8a7387d0 8a7387b0 nt!IoPageRead+0x109 (FPO: [Non-Fpo])
b954f5d8 80849cd5 00000001 cce52000 c0333948 nt!MiDispatchFault+0xd2a (FPO: [Non-Fpo])
b954f634 80837d0a 00000000 cce52000 00000000 nt!MmAccessFault+0x64a (FPO: [Non-Fpo])
b954f634 8093837b 00000000 cce52000 00000000 nt!KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ b954f64c)
b954f6fc f72880ac 89fe8a20 b954f72c 00000400 nt!CcMapData+0x8c (FPO: [Non-Fpo])
b954f71c f72886e6 b954fab0 89fe7d90 6fbd2000 Ntfs!NtfsMapStream+0x4b (FPO: [Non-Fpo])
b954f790 f729edd1 b954fab0 89fe3328 e2c47540 Ntfs!NtfsReadMftRecord+0x86 (FPO: [Non-Fpo])
b954f840 f728702b b954fab0 e22e5530 e14690d0 Ntfs!NtfsRetrieveOtherFileName+0xcf (FPO: [Non-Fpo])
b954fa60 f7286180 b954fab0 88d33008 89fe3328 Ntfs!NtfsQueryDirectory+0x90d (FPO: [Non-Fpo])
b954fa94 f7286452 b954fab0 e14690d0 8a562708 Ntfs!NtfsCommonDirectoryControl+0xbc (FPO: [Non-Fpo])
b954fc04 8083f9c0 89fe3248 88d33008 8a696910 Ntfs!NtfsFsdDirectoryControl+0xad (FPO: [Non-Fpo])
b954fc18 f7323c53 8a696910 89eff9e8 000007ff nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
b954fc40 8083f9c0 8a562708 88d33008 88d331bc fltmgr!FltpDispatch+0x6f (FPO: [Non-Fpo])
b954fc54 f7730da7 89eff9e8 88d331bc 88d33008 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
b954fc84 f773181e 89eff9e8 88d33008 00000ae4 SomeFiltr!OsrFilterDeregister+0x85
b954fce4 8083f9c0 89eff930 88d33008 01b0f710 SomeFiltr!OsrFilterDeregister+0xafc
b954fcf8 8092d3b8 b954fd64 01b0f710 80921a14 nt!IofCallDriver+0x45 (FPO: [Non-Fpo])
b954fd0c 80921a71 89eff930 88d33008 88dba440 nt!IopSynchronousServiceTail+0x10b (FPO: [Non-Fpo])
b954fd30 80834d3f 0000048c 00000000 00000000 nt!NtQueryDirectoryFile+0x5d (FPO: [Non-Fpo])
b954fd30 7c82ed54 0000048c 00000000 00000000 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b954fd64)
01b0f758 00000000 00000000 00000000 00000000 0x7c82ed54

And this is exactly what we have been looking for...  If we start from the top we can see that this thread is indeed a part of w3wp.exe so that is a good start.  Then if we look at the bottom of the stack we see that the entrypoint into kernel mode is nt!NtQueryDirectoryFile which jives extremely well with what we saw in usermode.  

From this point we go directly into some filter called SomeFiltr.sys  (in reality it is called something else, but since it doesn't really make a difference for this discussion what it is called I have changed the name to protect the innocent),  and from there it is going into to some reading and other things and finally it sits in a wait again, but this time in kernelmode.  I am glancing over the details here a bit since I don't know much about drivers.

At this point it is pretty much out of my hands and into the hands of the person who owns the filter in order to figure out what was causing it to be slow.  I looked the filtername up on the internet and it happened to belong to some backup/recovery software, and because of this and because we are blocking on trying to delete files from the temporary ASP.NET files which have no reason being backed up anyways, a reasonable suggestion would be to exclude this directory from backup.

 

Some other good commands I found in relation to kernel debugging were "!process 0 0" which lists all the processes, and "!process <process addr> f" which lists all the threads in a given process.

I feel like I should add a caveat here and say that there are probably other ways to get to this conclusion and more data that can be gotten from this, but I just wanted to share my first kernel debugging experience and hopefully someone out there can get some benefit from it:)

 

Laters y'all,

Tess





  • Link Listing - August 27, 2007

  • Nice article Tess! I get the feeling that you must keep a big list of "bad" software around to help w/ your troubleshooting. For the most part I keep my virus scanner in manual mode and pretty much disable most services that try to "help" keep my system healthy.  Instead of the services I usually see interesting problems in 3rd party software - gui controls for example - that don't clean up delegates correctly and end up causing memory leaks. Windbg definitely helps quickly track things down.

  • That's a very interesting kernel stack trace.

    Now I don't know much of anything about the Windows kernel internals, but I did notice the re-entrant call back to Ntfs through what looks like a page fault (KiTrap, MmAccessFault, MiDispatchFault, IoPageRead).  So it looks like it's trying to fault-in pages of data from disk.  That would seem to be a somewhat fragile state from which to be applying filters.

    So my guess is that the filter in question is deadlocking on an Ntfs lock due to inappropriate re-entrance when processing requests in this special context.  Fun stuff!

    Thanks for posting this.  =)

  • Hi Tess;

    Wonderful article as usual. Thank you for taking the time to share your insights.

    Just one question: would FileMon have picked up on the accesses to Microsoft.NET\Framework\*\Temporary ASP.NET Files? Interference there is generally a very common problem, wether it's Virus Scanners, Backup Software, or even Indexing Services.

  • Hi Nariman,

    I am not sure if it would have picked up what happens at the kernel level.  

    I want to point out though that this stack is not proof that there is anything wrong with the driver, in fact the problem could very well be that the amount of temp files is extremely large causing the slowness.

    The stack doesnt really show a deadlock it just shows that we have this filter intercepting all access which of course is an extra layer making it somewhat slower than direct access.

  • Hi Tess,

    Great article, but you kind of left us hanging...I know the problem is/was out of your hands by the end, but did you at least try having them exclude the ASP.NET Temporary Files folder from backup as you mentioned, and if so, did it have any effect?  Either way, we'd love to hear the follow-up if anything further is discovered...

    Great stuff!

    -Brian

  • Hi Brian,

    As it turned out, in this case the temporary asp.net files folder for this app happened to be very large, in fact extremely large because of different factors (80 GB), this together with the delay of the backup system caused this operation to be very slow.   It was fixed by dividing the app into smaller apps and turning batch compilation off so that only the small portion of the files that were actually used would be jit compiled causing a smaller set of temp asp.net files...  since there is a lot less temp files to deal with, deleting them is a lot quicker...

Page 1 of 1 (8 items)
Leave a Comment
  • Please add 1 and 5 and type the answer here:
  • Post