If broken it is, fix it you should

Using the powers of the debugger to solve the problems of the world - and a bag of chips    by Tess Ferrandez, ASP.NET Escalation Engineer (Microsoft)

.NET Crash: Managed Heap Corruption calling unmanaged code

.NET Crash: Managed Heap Corruption calling unmanaged code

Rate This
  • Comments 2

If someone asked me, what is the nastiest types of issues you get? I would say managed heap corruption, closely followed by native heap corruption.

Btw, if you are new to debugging .net issues with windbg this case study might not be the best one to start with. This is about as hard as it gets when debugging .net issues, so take a stroll through my other posts first.

Ok, you’re still here:) let’s see how we tackle this one…

Problem description

Once in a while, completely randomly the application (console application) crashes or it gets weird exceptions like null reference exceptions when we know the object is not null, and invalid handle exceptions. You name it...

Gathering information

Since we crash, the first thing we can do is to take a crash dump with adplus –crash. Or better yet, let's take a couple of crash dumps and compare. (Short note: Normally, I would get the first dump, determine that it is a managed heap corruption and ask for a few more dumps, but why not be proactive and save some time:))

Debugging the issue

We ran adplus –crash twice and in the crash mode directories we can find the following dump files

First set of dumps:

  • 1st chance access violation mini dump
  • 2nd chance access violation mini dump
  • 1st chance process shutdown full dump

Second set of dumps:

  • 1st chance invalid handle mini dump
  • 2nd chance invalid handle full dump
  • 1st chance process shutdown full dump

Starting off with the first set, we can skip the mini dump for now since we have a full dump presumably from the same access violation. Looking at the timestamps of the dumps I can see that the three dumps are taken within a couple of seconds so it appears that what happened was a 1st chance access violation that escalated to a 2nd chance since it wasn’t handled and led to the process shutdown.

I open up the 2nd chance access violation dump, and since the dump was triggered by an exception I'm positioned at the faulting thread by default and the following info is displayed in the debugger

eax=00ad22dc ebx=0015c828 ecx=ffffffff edx=00000006 esi=00000004 edi=0015433c
eip=7921636f esp=02c5fecc ebp=02c5fee0 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
mscorwks!CFinalize::SetSegForShutDown+0x94:
7921636f f70100000008 test dword ptr [ecx],0x8000000 ds:0023:ffffffff=????????

Assuming that the symbols are correct (which they are) we are access violating in mscorwks!CFinalize::SetSegForShutDown+0x94, at the assembly instruction "test dword ptr [ecx],0x8000000". The reason we access violate is because the [ecx] means that we dereference ecx (i.e. we are looking at what is at the address that ecx points to) and since ecx is ffffffff (null pointer) we access violate when trying to dereference it.

Well, this doesn’t really tell us much about what happened, from here we could probably run a search on the internet for similar issues, but unfortunately the search would be fruitless…

Let’s take a look at the stack...

0:002> kb
ChildEBP RetAddr  Args to Child              
02c5fee0 792166d0 00000000 00000000 791d0ed0 mscorwks!CFinalize::SetSegForShutDown+0x94
02c5ff24 791cede0 00000000 80915704 7ffdc000 mscorwks!GCHeap::FinalizerThreadStart+0x171
02c5ffb8 77e66063 0015c860 00000000 00000000 mscorwks!Thread::intermediateThreadProc+0x44
02c5ffec 00000000 791ced9f 0015c860 00000000 kernel32!BaseThreadStart+0x34

Ok, so it's the finalizer thread, hmm… looks like a bug in the framework maybe??? Well, I'm not going to tell you that there are no bugs in the framework but more often than not when we crash in a finalizer or GC stack it is not because of a framework bug, but because of managed heap corruption.

Let me stop here and define the term managed heap corruption.

Managed heap corruption

When you create an object in .net (string s = new string(...) for example) your object is allocated on the .net heap.

Objects on the .net heap are stored sequentially like this:

Address    Size  Type	
0x00ad21e0 16    System.Int32[] 
0x00ad21f0 16    System.Int32[] 
0x00ad2200 16    System.Byte[] 
0x00ad2210 16    System.Byte[] 
0x00ad2220 20    System.Byte[]

In this example we have an Int32[] (of 16 bytes or 0x10 bytes) at address 0x00ad21e0, followed by another Int32[] (of 16 bytes) at address 0x00ad21f0 etc. stored one after another like sardines in a box.

When a new object is created it gets created at the end of the last segment, right after the last object in that heap segment.

Now let’s assume this happens... Someone loops through the elements of the Int32 array setting the values of the Int32’s and somehow loops and writes past the end of the array, it would start overwriting the next Int32 array or whatever object is after it, and that object would no longer be valid (voila, a heap corruption)...

Ok, in .net that can’t really happen. If you write past the end of the array you will get an IndexOutOfRange exception since we don’t have pointers, but what if it could still happen somehow. Hmm, interesting theory:)

Bear with me for a moment and assume that we could somehow happen. What would happen then when we tried to access the object that was "destroyed".

Member variable addresses of the object could be overwritten with addresses that point to nowhere, or another object, or null causing all sorts of funky errors. And when the garbage collector tries to go through the objects finding out what should be garbage collected it would try to access objects at all sorts of addresses causing access violations when garbage collecting.

In short, managed heap corruption is particularly evil because it causes very unpredictable errors and because cause and effect is not easy to connect since the code that caused the problem is not the code that will suffer the effects.

Back to the problem at hand...

Let’s take a look at where we are at again

eax=00ad22dc ebx=0015c828 ecx=ffffffff edx=00000006 esi=00000004 edi=0015433c
eip=7921636f esp=02c5fecc ebp=02c5fee0 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
mscorwks!CFinalize::SetSegForShutDown+0x94:
7921636f f70100000008 test dword ptr [ecx],0x8000000 ds:0023:ffffffff=????????

Btw, if you want to get the output above and you have cleared the output, you can get this by running the command r

So we are executing code at 7921636f (our EIP=Current Instruction Pointer)

If we disassemble the code around this we get the following:

0:002> u 7921636f-10 7921636f +10
mscorwks!CFinalize::SetSegForShutDown+0xd0:
7921635f ffff             ???
79216361 ffff             ???
79216363 5e               pop     esi
79216364 5b               pop     ebx
79216365 8be5             mov     esp,ebp
79216367 5d               pop     ebp
79216368 c20400           ret     0x4
7921636b 8b07             mov     eax,[edi]
7921636d 8b08             mov     ecx,[eax]
7921636f f70100000008     test    dword ptr [ecx],0x8000000
79216375 8bcb             mov     ecx,ebx
79216377 0f85fb4e0400 jne mscorwks!CFinalize::SetSegForShutDown+0x9e (7925b278)
7921637d 6a06             push    0x6

From this we can tell that eax was assigned the value of what was in edi (edi dereferenced), then ecx got the value of what was in eax (eax dereferenced), and ecx is FFFFFF (null pointer).

0:002> dc 0015433c
0015433c  00ad22dc 00ad2324 00ad2ab0 00000000  ."..$#...*......
0015434c  00000000 00000000 00000000 00000000  ................
0015435c  00000000 00000000 00000000 00000000  ................
0015436c  00000000 00000000 00000000 00000000  ................
0015437c  00000000 00000000 00000000 00000000  ................
0015438c  00000000 00000000 00000000 00000000  ................
0015439c  00000000 00000000 00000000 00000000  ................
001543ac  00000000 00000000 00000000 00000000  ................

0:002> dc 00ad22dc 
00ad22dc  ffffffff ffffffff ffffffff ffffffff  ................
00ad22ec  ffffffff ffffffff ffffffff ffffffff  ................
00ad22fc  00ad273c 00000000 00000100 00000101  <'..............
00ad230c  00000000 79b9f15c 00000000 0000fde9  ....\..y........
00ad231c  00000100 00010000 79bd9a14 00000000  ...........y....
00ad232c  00ad23e4 00000000 00ac3508 00ad2310  .#.......5...#..
00ad233c  00ad23f4 00ad2514 00ad2408 00000000  .#...%...$......
00ad234c  00000080 00010000 80000000 79b94638  ............8F.y

Since we are suspecting a managed heap corruption, there is a handy command in sos.dll called !verifyheap that will tell us if the managed heap is "correct". It might give false positives if the managed heap is in movement during garbage collection, but most of the time it is very useful.

I did mention that managed heap corruptions were hard to debug, right?:)

0:002> !verifyheap
VerifyHeap will only produce output if there are errors in the heap
Bad MethodTable for Obj at 0x00ad228c
Last good object: 0x00ad2274

Hmm, interesting. Verifyheap tells us that there was a bad method table for an object at 0x00ad228c, pretty close to our address of eax. And that the last good object was found at 0x00ad2274. This means that something likely wrote outside of the bounds of the object at 0x00ad2274, corrupting the object that was supposed to be at 0x00ad228c, so maybe it even overwrote whatever was supposed to be at the address we are looking at 00ad22dc.

Time to take a look at the last good object...

0:002> !do 0x00ad2274
Name: System.Byte[]
MethodTable 0x009b2c3c
EEClass 0x009b2bc4
Size 24(0x18) bytes
GC Generation: 0
Array: Rank 1, Type System.Byte
Element Type: System.Byte
Content: 10 items

This was supposed to be a byte array with 10 items (10 bytes), but it seems like someone stored a lot more than that in there... maybe we can get a clue if we dump out the contents...

0:002> dc 0x00ad2274
00ad2274  009b2c3c 0000000a e011cfd0 e11ab1a1  <,..............
00ad2284  00000000 00000000 00000000 00000000  ................
00ad2294  0003003e 0009fffe 00000006 00000000  >...............
00ad22a4  00000000 00000001 00000046 00000000  ........F.......
00ad22b4  00001000 00000048 00000001 fffffffe  ....H...........
00ad22c4  00000000 00000045 ffffffff ffffffff  ....E...........
00ad22d4  ffffffff ffffffff ffffffff ffffffff  ................
00ad22e4  ffffffff ffffffff ffffffff ffffffff  ................
0:002> d
00ad22f4  ffffffff ffffffff 00ad273c 00000000  ........<'......
00ad2304  00000100 00000101 00000000 79b9f15c  ............\..y
00ad2314  00000000 0000fde9 00000100 00010000  ................
00ad2324  79bd9a14 00000000 00ad23e4 00000000  ...y.....#......
00ad2334  00ac3508 00ad2310 00ad23f4 00ad2514  .5...#...#...%..
00ad2344  00ad2408 00000000 00000080 00010000  .$..............
00ad2354  80000000 79b94638 00000007 00000006  ....8F.y........
00ad2364  00740073 00650072 006d0061 00000000  s.t.r.e.a.m.....

I don't know about you, but that doesn't seem like anything I recognize.

This is where I scratch my head and ask for another dump:)

Next set of dumps...

If we open up the 2nd chance Invalid Handle dump generated in the next set of logs, the debugger displays this:

(bf0.f1c): Invalid handle - code c0000008 (!!! second chance !!!)
eax=c0000008 ebx=0015c428 ecx=0012f5fc edx=7c82ed04 esi=0012f630 edi=0012f8b0
eip=7c82ed3b esp=0012f5a8 ebp=0012f5f8 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
ntdll!KiRaiseUserExceptionDispatcher+0x37:
7c82ed3b 8b0424           mov     eax,[esp]         ss:0023:0012f5a8=c0000008

We can see that we're stopped at the ExceptionDispatcher i.e. reporting the exception, so we'll have to look at the stack to get some information about where the exception occurred.

0:000> kb
ChildEBP RetAddr  Args to Child              
0012f5f8 7c82ed53 7c821144 77e6c1fe 544e4520 ntdll!KiRaiseUserExceptionDispatcher+0x37
0012f5fc 7c821144 77e6c1fe 544e4520 0012f654 ntdll!KiFastSystemCall+0x3
0012f600 77e6c1fe 544e4520 0012f654 0098a586 ntdll!ZwClose+0xc
0012f60c 0098a586 544e4520 7c82ed54 544e4520 kernel32!CloseHandle+0x59
WARNING: Frame IP not in any known module. Following frames may be wrong.
0012f69c 7923c069 0012f7b4 7923e0bc 0012f6f0 0x98a586
0012f6a4 7923e0bc 0012f6f0 00000000 0012f6c8 mscorwks!CallDescrWorker+0x30
0012f7b4 7923e2a7 00995333 0016f350 00000004 mscorwks!MethodDesc::CallDescr+0x1b8
0012f870 7923e315 00995333 0016f350 00402ac7 mscorwks!MethodDesc::CallDescr+0x4f
0012f898 7923a6b5 0012f928 00000000 0015c428 mscorwks!MethodDesc::Call+0x97
0012f950 7923a8fa 00995338 00000001 00000000 mscorwks!ClassLoader::CanAccess+0x1d6
0012fa64 7923a56f 0016f350 00000000 79041394 mscorwks!ClassLoader::ExecuteMainMethod+0x49d
0012fa7c 7923a4ba 00000000 0012fd70 00000000 mscorwks!Assembly::ExecuteMainMethod+0x21
0012fd60 791c6afa 00000000 00000001 0012ffe0 mscorwks!SystemDomain::ExecuteMainMethod+0x421
0012ffa0 791c69f2 8013141b 00000000 00000000 mscorwks!ExecuteEXE+0x1ce
0012ffb0 7917dce7 00000000 791b0000 0012fff0 mscorwks!_CorExeMain+0x59
0012ffc0 77e523cd 00000000 00000000 7ffde000 mscoree!_CorExeMain+0x30
0012fff0 00000000 7917dcbb 00000000 78746341 kernel32!BaseProcessStart+0x23

0:000> !clrstack
Thread 0
ESP         EIP       
0x0012f630  0x7c82ed3b [FRAME: NDirectMethodFrameStandalone] [DEFAULT] Boolean UsingPInvoke.FileReader.CloseHandle(I)
0x0012f640  0x02e302fe [DEFAULT] [hasThis] Boolean UsingPInvoke.FileReader.Close()
0x0012f65c  0x02e30196 [DEFAULT] I4 UsingPInvoke.Test.Main(SZArray String)
0x0012f8b0  0x7923c069 [FRAME: GCFrame] 
0x0012f9b0  0x7923c069 [FRAME: GCFrame] 
0x0012fa94  0x7923c069 [FRAME: GCFrame] 

From the Main method, we call into some FileReader.Close() that calls in to FileReader.CloseHandle(int), which then calls in to a native call kernel32!CloseHandle, calling into ntdll!ZwClose, and we get the exception.

We know we are looking for a bad handle, and CloseHandle is a Windows API so we can look this call up in MSDN to figure out how we are passing the handle.

BOOL CloseHandle(
  HANDLE hObject
);
Parameters
hObject 
[in] Handle to an open object. This parameter can be a pseudo handle or INVALID_HANDLE_VALUE. 

Fair enough... so the first (and only) parameter passed in is the bad handle... From the kb output this would be the first "Args to child", i.e. 544e4520 below.

ChildEBP RetAddr  Args to Child  
...
0012f60c 0098a586 544e4520 7c82ed54 544e4520 kernel32!CloseHandle+0x59

0:000> !handle 544e4520
Handle 544e4520
  Type         	<Error retrieving type>

Yepp, certainly looks bogus... A handle is normally looks something like this: 0000077c, and if you dump it out with !handle you would see something along the lines of

0:000> !handle 0000077c
Handle 0000077c
  Type         	Event

So how did our handle get like that?

From the stack we know that it was passed in from FileReader.Close to FileReader.CloseHandle. Let's dump all the parameters and locals we can get to with !clrstack –all

0:000> !clrstack -all
Thread 0
ESP         EIP       
ESP/REG    Object     Name
0x0012f630  0x7c82ed3b [FRAME: NDirectMethodFrameStandalone] [DEFAULT] Boolean UsingPInvoke.FileReader.CloseHandle(I)
ESP/REG    Object     Name
0x0012f640  0x02e302fe [DEFAULT] [hasThis] Boolean UsingPInvoke.FileReader.Close()
    EDI 0x00ad2124 ESI 0x00000000 EBX 0x00ad2124 EDX 0x7c82ed04 ECX 0x0012f5fc 
    EAX 0xc0000008 EBP 0x0012f654 ESP 0x0012f640 EIP 0x02e302fe 
  at [+0x16] [+0x0] g:\windbgdemos\usingpinvoke\class1.cs:78
    PARAM: this: 0x00ad2124 
    LOCAL: bool CS$00000003$00000000: false
ESP/REG    Object     Name
0x0012f648 0x00ad16b8 System.Object[]
0x0012f65c  0x02e30196 [DEFAULT] I4 UsingPInvoke.Test.Main(SZArray String)
    EDI 0x00ad16b8 ESI 0x00000000 EBX 0x00ad2124 EDX 0x7c82ed04 ECX 0x0012f5fc 
    EAX 0xc0000008 EBP 0x0012f69c ESP 0x0012f65c EIP 0x02e30196 
  at [+0x13e] [+0x74] g:\windbgdemos\usingpinvoke\class1.cs:116
    PARAM: class System.String[] args: 0x00ad16b8 
    LOCAL: class System.Text.ASCIIEncoding Encoding: 0x00ad213c 
    LOCAL: int32 bytesRead: 128
    LOCAL: int32 CS$00000003$00000000: 0
    LOCAL: unsigned int8[] buffer: 0x00ad210c 
    LOCAL: class UsingPInvoke.FileReader fr: 0x00ad2124 
ESP/REG    Object     Name
0x0012f694 0x00ad210c System.Byte[]
0x0012f6cc 0x00ad16b8 System.Object[]
0x0012f8b0  0x7923c069 [FRAME: GCFrame] 
ESP/REG    Object     Name
0x0012f8cc 0x00ad16cc System.String    c:\output.txt
0x0012f928 0x00ad16b8 System.Object[]
0x0012f92c 0x00ad16b8 System.Object[]
0x0012f9b0  0x7923c069 [FRAME: GCFrame] 
ESP/REG    Object     Name
0x0012fa94  0x7923c069 [FRAME: GCFrame]

Since the handle is passed in to the CloseHandle function as an int, it will get passed in in a register so we can't really see it here, but we can dump out the code for Close to take a look at it

0:000> !u 0x02e302fe 
Will print '>>> ' at address: 0x02e302fe
Normal JIT generated code
[DEFAULT] [hasThis] Boolean UsingPInvoke.FileReader.Close()
Begin 0x02e302e8, size 0x29
02e302e8 55               push    ebp
02e302e9 8bec             mov     ebp,esp
02e302eb 83ec08           sub     esp,0x8
02e302ee 57               push    edi
02e302ef 56               push    esi
02e302f0 53               push    ebx
02e302f1 8bf9             mov     edi,ecx
02e302f3 33f6             xor     esi,esi
02e302f5 8b4f04           mov     ecx,[edi+0x4]
02e302f8 ff15c0549900     call    dword ptr [009954c0] (UsingPInvoke.FileReader.CloseHandle)
>>> 02e302fe 0fb6d8           movzx   ebx,al
02e30301 0fb6c3           movzx   eax,bl
02e30304 8bf0             mov     esi,eax
02e30306 eb00             jmp     02e30308
02e30308 8bc6             mov     eax,esi
02e3030a 5b               pop     ebx
02e3030b 5e               pop     esi
02e3030c 5f               pop     edi
02e3030d 8be5             mov     esp,ebp
02e3030f 5d               pop     ebp
02e30310 c3               ret

Just prior to the CloseHandle call we can see that what is at address edi+0x4 gets stored in ecx, this is probably our handle...

EDI at this point was 0x00ad2124 from the !clrstack output, and EDI usually contains the this pointer (so the FileReader, unless it is a static function).

If we take a peak at 0x00ad2124+0x4 where our handle is supposed to be we get this:

0:000> dc 0x00ad2124+0x4
00ad2128  544e4520 52505245 20455349 54494445   ENTERPRISE EDIT
00ad2138  204e4f49 444c4f47 4e490a0d 4e524554  ION GOLD..INTERN
00ad2148  45205445 4f4c5058 20524552 20302e36  ET EXPLORER 6.0 
00ad2158  20524f46 444e4957 2053574f 56524553  FOR WINDOWS SERV
00ad2168  32205245 20333030 444c4f47 49570a0d  ER 2003 GOLD..WI
00ad2178  574f444e 454d2053 20414944 59414c50  NDOWS MEDIA PLAY
00ad2188  39205245 52455320 20534549 00ad25d4  ER 9 SERIES .%..
00ad2198  00000000 00000100 00000101 00000000  ................

Ok, so we can clearly see where it got the 544e4520 from... but hey!!! Take a look at the right hand side… that is text, this was supposed to be a handle address... ding ding ding ding!!! Looks an awful lot like we were right in thinking that someone is overwriting the memory...

My coworker Doug calls this debugging technique the "poking around a bit until you find something interesting" technique:) and more often than not, this is the only technique that will work:)

Time for !verifyheap again...

0:000> !verifyheap
VerifyHeap will only produce output if there are errors in the heap
Bad MethodTable for Obj at 0x00ad2124
Last good object: 0x00ad210c

The last good object is a byte[10] again. Does it seem familiar?:)

0:000> !do 0x00ad210c
Name: System.Byte[]
MethodTable 0x009b2c3c
EEClass 0x009b2bc4
Size 24(0x18) bytes
GC Generation: 0
Array: Rank 1, Type System.Byte
Element Type: System.Byte
Content: 10 items

This time it contains some data though...

0:000> dc 0x00ad210c
00ad210c  009b2c3c 0000000a 444e4957 2053574f  <,......WINDOWS 
00ad211c  56524553 32205245 2c333030 544e4520  SERVER 2003, ENT
00ad212c  52505245 20455349 54494445 204e4f49  ERPRISE EDITION 
00ad213c  444c4f47 4e490a0d 4e524554 45205445  GOLD..INTERNET E
00ad214c  4f4c5058 20524552 20302e36 20524f46  XPLORER 6.0 FOR 
00ad215c  444e4957 2053574f 56524553 32205245  WINDOWS SERVER 2
00ad216c  20333030 444c4f47 49570a0d 574f444e  003 GOLD..WINDOW
00ad217c  454d2053 20414944 59414c50 39205245  S MEDIA PLAYER 9
0:000> d
00ad218c  52455320 20534549 00ad25d4 00000000   SERIES .%......
00ad219c  00000100 00000101 00000000 79b9f15c  ............\..y
00ad21ac  00000000 0000fde9 00000100 00010000  ................
00ad21bc  79bd9a14 00000000 00ad227c 00000000  ...y....|"......
00ad21cc  00ac3508 00ad21a8 00ad228c 00ad23ac  .5...!..."...#..
00ad21dc  00ad22a0 00000000 00000080 00010000  ."..............
00ad21ec  80000000 79b94638 00000007 00000006  ....8F.y........
00ad21fc  00740073 00650072 006d0061 00000000  s.t.r.e.a.m.....

Some kind of string... I cant tell exactly from where but I think a good next step would be to go digging in the code for someone allocating a Byte[10] and then overwriting the boundaries.

Brief excursion

The most common reason for managed heap corruption is bad PInvoke's. We pass in a buffer to a native (non .net) API which the native API is supposed to return some data in, but the buffer is too small for the results. Since the API has no clue about .net and no clue about the boundaries, it just happily writes its data.

Looking at the code

I wrote my application based on an MSDN sample on how to use the ReadFile function in C#

I then modified the code a little bit so my main function has this code... notice that the first parameter passed in to fr.Read is a byte[10] and the 3rd parameter is 128...

byte[] buffer = new byte[10];
FileReader fr = new FileReader();

if (fr.Open(args[0]))
{
	int bytesRead;
	bytesRead = fr.Read(buffer, 0, 128);
	fr.Close();
	return 0;
}

My read function is basically a call to the ReadFile API, passing a buffer, a start index, and the number of bytes I want to read

public unsafe int Read(byte[] buffer, int index, int count)
{
	int n = 0;
	fixed (byte * p = buffer)
	{
		if (!ReadFile(handle, p + index, count, &n, 0))
		{
			return 0;
		}
	}
	return n;
}

And the ReadFile function is defined in my app like this

[System.Runtime.InteropServices.DllImport("kernel32", SetLastError = true)]
static extern unsafe bool ReadFile
(
	System.IntPtr hFile,      // handle to file
	void* pBuffer,            // data buffer
	int NumberOfBytesToRead,  // number of bytes to read
	int* pNumberOfBytesRead,  // number of bytes read
	int Overlapped            // overlapped buffer
);

So when I call Read(buffer, 0, 128) it will put the first 128 bytes of the file in the buffer, which I had erroneously defined as being 10 bytes, and thus it will overwrite whatever object happened to come after my buffer on the managed heap, and cause managed heap corruption.

In the first case (access violation) I had read a word doc, and the bytes I read contained FFFFFFFF which caused the access violation. In the second case I had read a text file, so what was supposed to be a handle was actually some text from the file.

The moral of the story? First off, make sure that if you call an API and one of the parameters is an [out] parameter you need to make sure that your buffer is large enough to store the result. Secondly, if you get this kind of issue, the first thing you should look for in the code is calls to unmanaged API’s.

Adam Nathan has created a site called http://www.pinvoke.net and an awesome add-in to Visual Studio for generating correct PInvoke/API signatures. Check it out on his blog at http://blogs.msdn.com/adam_nathan/archive/2004/05/06/127403.aspx

Laters...  

 





  • For situations like this, it's still hard to tell where the corruption is occuring.  In this case, the contents of the buffer provided a hint, but that's not always going to help.  One thing that may help is to use !gcroot on the last good object (the byte[10] in this case) - if the object is still alive, it root path should give an indication of where in the source code this object is being held (and therefore where it is being used).  Often by looking at how the specific object is being used in the source, you can quickly find some suspect PInvoke or other call which could be causing the corruption.

    In the long run, my team (CLR debugging support) would like to enable "time-traveling debugging" where you could record an execution trace (if the repro scenario permits) , and then just say "go back to the spot where this memory was overwritten".  In some cases like this, it could make tracking down these sorts of problems extremely easy!
  • Thanks for pointing out !gcroot, these isssues are definitely not among the easier ones to get to the bottom of.

    I, for one, am really excited about the "time-traveling debugging". I really cant wait to see it.

    Currently if we cant find the fault using the steps above or with gcroot like Rick mentions, we have to run something called GCStress which essentially triggers a gc on each allocation so we get the exception closer to the faulting call, but of course as you can imagine this is really bad performance wise so it cant generally be done in production.  And this is why im doubly excited about the time travelling stuff...  plus, anything involving the word time-travel just has to be exciting right:)
Page 1 of 1 (2 items)
Leave a Comment
  • Please add 3 and 8 and type the answer here:
  • Post