Hello my name is Steven Toscano, I am an SDET Tech Lead on the Visual C++ team. I am currently working in the IDE group focusing on improving our testing methodologies and expanding our library of tools. I’ve been a member of many teams across Visual C++ during my 7 year career including front-end, back-end optimizers, libraries (MFC, ATL, CRT), and now IDE.
One of the many pleasures I get from doing my job is the ability to diagnose a problem and discover its root cause. I love digging into things as deep as I can until the asm is staring me in the face exposing those nasty little bugs. Sometimes going that deep isn’t necessary, sometimes you have to get creative about how to diagnose an issue. This is especially the case when you’re dealing with complex multi-threading bugs.
One particular bug I looked into turned into a three day investigation. The net effect of the bug looked like this:
1. Open devenv
2. Load a solution that contains two C++ projects
3. Go to the Options dialog and set the “maximum number of parallel project builds” to 2 (I think this is the default if you’ve installed VS on a dual-processor machine)
4. Build the solution
Result is an Access Violation at VCProjectEngine.dll!CBldFileRegEntry::ReleaseFRHRef()
That’s right, the IDE just CRASHES! The worst possible behavior a customer could encounter (aside from data loss). And even better, this crash is not consistently reproducible; it might happen in 1 out of 100 times or not at all.
So the typical process is to get exact repro steps, a full callstack when the AV hits and to attach a dump to the bug. After some quick searching in our bug database I found a bug with a similar callstack logged by someone outside our team about one month prior. I was a bit concerned because this bug was resolved as No Repro and included full dump info!
Since I was encountering this problem on my multi-proc machine I decided to take a deeper look. First of all the callstack up to the AV looked a bit weird. Notice the function at the top of the stack.
Doesn’t resolve to symbol name and sure enough a check in the memory window shows this address is pointing into outer space. Here is the relevant source code at this point:
// Prevent access to registry by other threads.
if (m_nRefCount <= 0)
if (m_nRefCount == 0)
The AV occurred when the SafeDelete function was called. I hit this bug once and left the debug session open on my machine for the whole three days. Given the working set tax of debugging this was slowing down my machine for my normal day-to-day duties, but I had to leave it just in-case I couldn’t repro it again.
Drilling into this mysterious AV at a callsite, I thought it was bad code gen, here is the relevant asm for the ReleaseFRHRef function:
if (m_nRefCount <= 0)
5B00B14A mov eax,dword ptr [edi+18h]
5B00B14D test eax,eax
5B00B14F jbe CBldFileRegEntry::ReleaseFRHRef+32h (5B00B15Eh)
5B00B151 dec eax
5B00B152 mov dword ptr [edi+18h],eax
if (m_nRefCount == 0)
5B00B155 jne CBldFileRegEntry::ReleaseFRHRef+32h (5B00B15Eh)
5B00B157 mov eax,dword ptr [edi]
5B00B159 mov ecx,edi
5B00B15B call dword ptr [eax+8]
The last three instructions is the compiler generated code for a call through a vtable. Sure enough after debugging the code SafeDelete is the first virtual function call that we are making since we started executing code in this class. EDI is the this pointer, dereferencing that gives you the entry into the vtable. An offset of 8 on the vtable should give you back the address of the SafeDelete virtual function (since it’s the third vfunction). But as we can see indirecting that address gives the access violation. So someone trashed the contents of our object including the vtable.
Now the question is who trashed our object? This was beginning to smell like a ref counting problem. I started to look at how the CBldFileRegEntry class was being used by looking at its implementation and where it is referenced. This class was retro-fitted to allow for simultaneous access through different threads. This was done through a shared reference or handle where uses of this handle were protected by Critical Sections sprinkled throughout the code.
After Everett shipped the Visual C++ team started to work on parallel project building for the Whidbey product (see this blog for details). This feature is great for multi-processor machines if you have a large solution with many leaf-node projects (i.e. projects that don’t have intra-dependencies). The set of classes that are on this callstack are some of the ones that were modified to get this functionality enabled in Whidbey.
So all of this cool functionality but with random crashes? I had to take a deeper look at what was going on. The ref counting theory was beginning to make sense. If one thread had decremented the reference count too early another thread might think there are no references left to the shared object and delete it. While another thread is still using the object – and the next access to the object’s contents results in an AV right in the middle of the code. The fact that we’re seeing an AV is a good thing – Visual C++ 2005 will dirty the memory if you delete an object (filling it with sequences of 0xfeeefeee) imagine if that wasn’t the case – you wouldn’t see any crash just random incorrect behavior.
There wasn’t much you could do with the information in the bug – if it was not easily reproducible you could not rerun the scenario and walk through the code. And with the dump that was attached you could only see the *end* result – the thread that did the damage is long gone.
So I took a different approach and extracted the relevant pieces of the code into my own simulation. This was easy since the class was pretty self-contained, any references or dependencies to things that weren’t important I removed. Then I created a housing for the code – lifted straight off this MSDN article. It was a simple app that created threads through the CreateThread() Win32 API. I set the number of threads to 30. And in my ThreadProc I added the code that accesses the shared resource the same way as the CBldFileRegEntry code. Sure enough I hit the AV like once every other run. So now I had a very consistent repro!
With this little setup I was able to start closing in on the bug by using a binary search technique. I put a big Critical Section block around the first half of the code and left the rest as is. I ran the scenario and it didn’t repro (since that whole piece of code was blocking all threads!). Since it didn’t hit I tried the second half and it started to repro. Now that I had the right half I continued the process until I was down to a function level. Here was the suspect function:
BldFileRegHandle CBldFileRegFile::GetFileHandle(LPCOLESTR szFileName, BOOL bVerifyCase)
szKey = path;
szKey = szFileName;
BldFileRegHandle hndFileReg = NULL;
// Prevent access to file map by other threads.
hndFileReg = LookupFileHandleByName(szKey, FALSE);
if (NULL == hndFileReg)
CBldFileRegFile * pregfile = new CBldFileRegFile(&path, FALSE /* already verified */);
if (NULL != pregfile)
hndFileReg = (BldFileRegHandle)pregfile;
VSASSERT(NULL != s_pFileNameMap, "File name map not initialized before we tried to use it.");
if (NULL != s_pFileNameMap)
ATLTRACE("Trying to use non-initialized file name map!\n");
#endif // TRACE_FILE_REGISTRY_ASSERTS_LEVEL2
Can you spot the bug? It turns out drilling in further I found the block of code that failed if the whole thing was not protected:
Now the bug isn’t that the Critical Section is in the wrong place – its that the AddFRHRef function is not in the same block as the Critical Section. So this code should really be:
Moved line in bold. Anytime you are grabbing a reference to the shared handle you need to increment the refcount. I could see how this could look of confusing since the implementation of AddFRHRef looked like this:
Makes you think: well if the Add Ref is protected then I am ok right? No! You need to match up your Critical Section blocks. This fun investigation led to a number of interesting ideas and possible tools to assist with debugging and even correctness testing. These will be used to improve the quality of our product for the next release. And yes this bug was fixed in RTM so your parallel-build experience should be crash free.
Its important to note that while all bugs can’t undergo this level of scrutiny – there are some that are particularly hairy and if not diagnosed could be problematic for large scale customer distribution. And of course if the investigation leads to tools and techniques to remedy the situation in other codebases then effort was well worth it!
Any thought about trying to enhance PREfast to catch errors like this?
A better approach to preventing this type of issue would be to use smart pointer objects to manage reference counting, ala boost::shared_pointer<>.
You reduce the need for special tools to detect this kind of problem if you take steps to prevent it in the first place.
Three-day debugging sessions are exactly the reason why proper hygiene must be employed. The code you were debugging is inherently unsafe, like handling razor blades with razor blades is unsafe. First, all the ref-counting should be abstracted away int a class that knows how to deal with it -- you have one in house, called CComPtr. There is a number of tricks, such as assignment to self, that your ATL team thought trough. Second, AddRef and Release should use InterlockedIncrement / InterlockedDecrement -- again, solved problems. Finally, you underestimate the number of different run-cases in your code. You haven't fixed the problem, you're still handling razor blades with bare hands. For example, your C-cast is potentially invalid, and should be a static_cast. The AddRef should have happened the moment new was called.
Thanks for all your comments and suggestions. Just to be clear, the example given was a code snippet taken from an old codebase. This problem was discovered as part of a regular review to get rid of this type of unsafe code. We have been making a concerted effort to replace these artifacts with things that are inherently more safe, such as CComPtr.
good post.. i enjoyed reading it. I am also glade to see that you are using a smart pointer to handle your reference counts.
keep up the good work !
Thats some seriously bad code right there...
> And yes this bug was fixed in RTM so your
> parallel-build experience should be crash
I've got news for you. And dumps, I've probably sent a dozen.
On one machine with hyperthreading and XP, I disabled hyperthreading a few months ago. VS2005 no longer parallel-builds and it has become more stable.
By the way I've seen "No repro" quite a lot, in Visual Studio and other products. To me those words just mean "Reproducible only outside of Microsoft".