Heap corruption is by nature a complicated issue to troubleshoot and in some cases luck is as important as debugging knowledge as well. I got an assistance request from one customer saying that they encountered a W3WP process crash intermittently. He reported that the server was indeed under heavy load and sometimes would report HTTP 500 error but sometimes the process was just terminated unexpectedly. It crashed with the second chance access violation (C0000005) exception and we were able to get the crash dump with DebugDiag.
From the dump, we can find the thread crashed on CUSTOM_ERROR_TABLE::FindCustomError. With IIS source code (which is something I can’t share with you guys), I know the function was to find the applicable custom error entry for a given status/subcode.
Here was the call stack:
# ChildEBP RetAddr
00 06cee2cc 5a49fb48 w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18
01 06cee428 5a42392b w3core!ISAPI_REQUEST::GetCustomError+0x8e
1b 06ceffb8 7c82482f msvcrt!_threadstartex+0x74
1c 06ceffec 00000000 kernel32!BaseThreadStart+0x34
The calls stack was quite clean without any 3rd components to suspect. J It’s also very rare that the crash happened within IIS module and as you may not know w3core.dll is the core component in IIS 6(iiscore.dll in IIS 7) there can’t be a bug in it as IIS 6 has been released for more than 7 years and the number of calling FindCustomError can be as big as myriad. If there is a bug, it can’t survive for a minute.
But the fact is it just crashed in it. Why? I have no idea and have to start with checking the register status: ESI is null, which seems to be the direct culprit.
eax=06ce0000 ebx=06ceedb4 ecx=017c4774 edx=000006e2 esi=00000000 edi=06ceedb4
eip=5a49fbd2 esp=06cee2c4 ebp=06cee2cc iopl=0 nv up ei ng nz ac po cy
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00010293
5a49fbd2 668b4608 mov ax,word ptr [esi+8] ds:0023:00000008=????
Now what we want to do is we want to check out why ESI was null so we analyzed the below assembly. I gotta admit that after undergraduate I rarely did assembly programming. Things in mind were just as simple as push, pop, mov…. So don’t stop from here if you are not familiar with assembly. J
The below assembly code still seems to be simple. We see the address stored in ECX is null which caused the C++ exception.
w3core!CUSTOM_ERROR_TABLE::FindCustomError [d:\nt\inetsrv\iis\iisrearc\iisplus\ulw3\customerror.cxx @ 48]:
5a49fbac 8bff mov edi,edi
5a49fbae 55 push ebp
5a49fbaf 8bec mov ebp,esp
5a49fbb1 56 push esi
5a49fbb2 57 push edi
5a49fbb3 8b7d10 mov edi,dword ptr [ebp+10h]
5a49fbb6 85ff test edi,edi
5a49fbb8 0f8467fdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x59 (5a49f925)
5a49fbbe 837d1400 cmp dword ptr [ebp+14h],0
5a49fbc2 0f845dfdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x59 (5a49f925)
5a49fbc8 8b31 mov esi,dword ptr [ecx]<=================poi(ecx) -> esi ; poi(ecx) = null
5a49fbca 3bf1 cmp esi,ecx
5a49fbcc 0f8448fdffff je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x38 (5a49f91a)
w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18 [d:\nt\inetsrv\iis\iisrearc\iisplus\ulw3\customerror.cxx @ 70]:
5a49fbd2 668b4608 mov ax,word ptr [esi+8] <==============esi is null
5a49fbd6 663b4508 cmp ax,word ptr [ebp+8]
5a49fbda 7404 je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x22 (5a49fbe0)
5a49fbdc 8b36 mov esi,dword ptr [esi]
5a49fbde ebea jmp w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x34 (5a49fbca)
5a49fbe0 668b460a mov ax,word ptr [esi+0Ah]
5a49fbe4 663b450c cmp ax,word ptr [ebp+0Ch]
5a49fbe8 0f841c780000 je w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x3f (5a4a740a)
Why ECX is NULL? We know “this call” calling convention (used for calling C++ non-static member functions) will pass “this” pointer in ECX. As you see, dt this will show the structure name. Although the debugger says “this” is stored in edx, it's wrong! It should be in ecx which is null.
0:033> dt this
Local var @ edx Type CUSTOM_ERROR_TABLE*
0:033> dd poi(ecx) L1
Then we dumped the CUSTOM_ERROR_ENTRY and would like to see why the object CUSTOM_ERROR_TABLE was null.
0:033> !list "-t ntdll!_LIST_ENTRY.Flink -e -x \"dt w3core!CUSTOM_ERROR_ENTRY @$extret\" 017c4774"
dt w3core!CUSTOM_ERROR_ENTRY @$extret
+0x000 _listEntry : _LIST_ENTRY [ 0x1dc7540 - 0x1d60620 ]
+0x008 _StatusCode : 0xbda0
+0x00a _SubError : 0x1db
+0x00c _strError : STRU
+0x03c _fIsFile : 0n0
+0x000 _listEntry : _LIST_ENTRY [ 0x0 - 0x0 ]
+0x008 _StatusCode : 0
+0x00a _SubError : 0
0:033> ? 0xbda0
Evaluate expression: 48544 = 0000bda0
Obviously, the custom error list is completely corrupted so we can't get actual custom error 500.100 (Internal Server Error - ASP error).
0:033> .frame 0
00 06cee2cc 5a49fb48 w3core!CUSTOM_ERROR_TABLE::FindCustomError+0x18 [d:\nt\inetsrv\iis\iisrearc\iisplus\ulw3\customerror.cxx @ 70]
this = 0x000006e2
StatusCode = 0x1f4<=====500
SubError = 0x64<======100
pfIsFile = 0x06ceedb4
pstrError = 0x06cee334
Checked the list entry address 0x1d60620 0x1dc7540 , it belongs to head 0x2b0000 which is msvcrt heap.
0:033> !address 0x1d60620
Allocation Base: 01d50000
Base Address: 01d50000
End Address: 01e50000
Region Size: 00100000
Type: 00020000 MEM_PRIVATE
State: 00001000 MEM_COMMIT
Protect: 00000004 PAGE_READWRITE
More info: heap containing the address: !heap 0x2b0000
More info: heap entry containing the address: !heap -x 0x1d60620
Heap 3 - 0x002b0000
Heap Name msvcrt!_crtheap
Heap Description This heap is used by msvcrt
Reserved memory 3.13 MBytes
Committed memory 1.64 MBytes (52.38% of reserved)
Uncommitted memory 1.49 MBytes (47.63% of reserved)
Number of heap segments 3 segments
Number of uncommitted ranges 1 range(s)
Size of largest uncommitted range 1.43 MBytes
Calculated heap fragmentation 3.94%
At the current stage, we believe that it is a typical heap corruption. While debugging heap corruption issues is not an easy task because the thread that causes the exception is not usually the thread that caused the corruption (FindCustomError is the victim in this case), we still can use pageheap.exe with full switch to capture another round of IIS crash dump. After several days monitoring, we were able to collect what we want and find out the culprit module. We are lucky as pagehelp didn’t keep silent.
Something else we’ve done is we searched the 0001003f pattern and hoped to find some clues. No luck! But it is really useful in some cases while it really requires luck (charter) as well. A good post for your reference here about 0001003f pattern: http://blogs.msdn.com/slavao/archive/2005/01/30/363428.aspx