The official source of product insight from the Visual Studio Engineering Team
This post describes a problem we encountered and solved during the development of Visual Studio 2010 when we rewrote some components in managed code. In this post I’ll describe the problem and what we did to solve it. This is a rather lengthy technical discussion and I apologize in advance for its dryness. It’s not essential to understand every detail but the lesson we learned here may be valuable for others working with large “legacy” code bases.
As I mentioned in the Background, we rewrote several components for Visual Studio 2010. Specifically, the window manager, the command bars and the text editor. In previous versions of Visual Studio, these were native components written in C++. In Visual Studio 2010, each was rewritten in managed code using C#.
Extensions which plug into Visual Studio communicate with these components through COM interfaces. Moving to managed code doesn’t change how these extensions communicate with platform components. Indeed, that’s the promise of interface based programming – i.e. you don’t need to know implementation details in order to communicate with a component via its interface. In the case of the new editor, it so happens that we introduced a new, managed programming model for new extensions, but even so, we had to keep the existing COM interfaces for older extensions.
Managed code and COM are brought together through the magic of COM Interop. Briefly, this allows two things to happen:
Let’s take each of these in turn. (If you already know how COM Interop works, you can skip the following two sections.)
The Common Language Runtime, CLR or just the “Runtime”, can make a COM object look just like a regular managed object. This is a special kind of object called a “Runtime Callable Wrapper” or RCW. RCWs bridge the managed, garbage-collected world with the native, ref-counted world. An RCW is created when “an IUnknown enters the runtime” (IUnknown is the minimum interface that all COM objects must implement). When does that happen? Usually, as the result of an interop call to a native method which hands back a COM interface. In fact, typically, it’s the result of method call on an existing COM object. Since that sounds a little bit like a “chicken and egg problem”, let me give a concrete example. At the heart of the Visual Studio platform lies the Global Service Provider. This service provider keeps track of services offered up (“proffered”) by components in the system. Other components can request a service by calling the IServiceProvider.QueryService method on the Global Service Provider object. If successful, the service returned to the caller will be another COM object, identified by a pointer to its IUnknown interface. If the component making the QueryService call is managed then, at the point where that pointer enters the Runtime, an RCW is created for the service. Of course, this still begs the question: “How did the managed component get hold of the Global Service Provider?”. The answer is that the Global Service Provider was passed to the managed component by the platform when that component was first initialized.
The tools and the Runtime make this very easy. To implement a COM interface on a managed object, you first need to locate or create an interop assembly containing the managed equivalent of that COM interface. By referencing that interop assembly from managed code and writing classes which implement those interfaces, you create COM compatible managed classes. There are a few other requirements (e.g. your classes must also be marked as COM visible either at the assembly level or on a per class basis), but otherwise it’s straightforward. When an instance of once of these classes is passed through the interop layer to native code, the CLR creates a COM Callable Wrapper or CCW. The CCW, among other things, preserves all the COM rules about identity and the lifetime of the wrapped object. For example, for as long as at least one native component holds a reference on the CCW, then the underlying managed object cannot be claimed by the Garbage Collector, even if there are no other managed roots. As far as the native code is concerned, it deals with an IUnknown, unaware that the object is really a managed object.
With that rather lengthy (my apologies) recap of COM Interop out of the way, let me describe the problem. Imagine, for the sake of a simplified example, that you have a component called the “Text Manager”. The Text Manager, as you might guess, handles requests about textual things in an editor. Other components communicate to the text manager via the ITextManager interface with methods such as “GetLines”, or “HighlightWord”. ITextManager is a COM interface. Now imagine that there’s a second component that implements a “Search” facility for finding words in a document. The Search component is written in managed code. Obviously, this Search component will need access to the Text Manager to get its job done, and I’m going to lead you through the scenario of performing a “Find” – once when the Text Manager is implemented in native code, and a second time when the Text Manager is managed.
The ‘find’ operation begins with the Search component asking for the Text Manager service via the Global Service Provider. This succeeds and the Search Manager gets back a valid instance of ITextManager. Since, in this first walkthrough, the Text Manager is a native COM object, the IUnknown returned is wrapped by the runtime in an RCW. As far as the Search Manager is concerned, though, it sees ITextManager. It doesn’t know or care (yet) whether the actual implementation is native or managed. The find operation continues with the Search component making various calls through ITextManager to complete its task. When the task is done, the ‘find’ operation exits and life is good. Well… almost. The ITextManager is an RCW and, as such it has the same kind of lifetime semantics as any other managed object – i.e. it will be cleaned up as and when the Garbage Collector runs. If there’s not much memory pressure in the system, then the Garbage Collector may not run for a long time – if at all – and here is where the native and managed memory models clash to create a problem. You see, as far as the Search component is concerned it’s finished with the Text Manager – at least until the next ‘find’ operation is requested. If there were no other components needing the Text Manager, now would be a great time for the Text Manager to be cleaned up. Indeed, if the Search component were written in native code, at the point of exiting the ‘find’ routine, it would call “Release” on the ITextManager to indicate that it no longer needs the reference. Without that final “Release”, it looks like a reference counting leak of the Text Manager – at least until the next garbage collection. This is a special, though not unusual case of non-deterministic finalization.
This is just an example, but situations just like it really happened many times during Visual Studio 2005 and 2008 development. The bug reports would say that ‘expensive’ components were being reported as leaked objects, usually at shutdown. The "solution”, as a few people discovered, was to insert a call to “Marshal.ReleaseComObject” at the point where the expensive component (the Text Manager in our example) was no longer needed. The RCW is released, causing its internal reference count to drop by one and, typically releasing the underlying COM object. No more leaked references and problem solved! Well, at least for now, as we’ll see. Regretfully, once this “solution” appeared in the source code of a few components, it spread rapidly as the ‘quick fix’ for leaked components and that’s how we shipped. The trouble started when we began migrating some components from native code to managed code in VS 2010.
To explain, I’ll return to the ‘find’ scenario, this time with the Text Manager written in managed code. The Search component, as before, requests the Text Manager service via the Global Service Provider. Again, an ITextManager instance is returned and it’s an RCW. However, this RCW is now a wrapper over a COM object which is implemented in managed code – a CCW. This double wrapping (an RCW around a CCW) is not a problem for the CLR and, indeed, it should be transparent to the Search component. Once the ‘find’ operation is complete, control leaves the Search component and life is good. Except that, on the way out the Search component still calls “Marshal.ReleaseComObject” on the ITextManager’s RCW and, “oops!” we get an ArgumentException with the message “The object's type must be __ComObject or derived from __ComObject.”. You see, the CLR is able to see through the double-wrapping to the underlying component and figure out that the it is really a managed object.
There’s really no workaround for this except to find all the places where “ReleaseComObject” was called and remove them. Some have suggested that, before calling ReleaseComObject we should check first if it’s going to succeed by calling “Marshal.IsComObject” but, as we’ll see in the next section there is another, more insidious problem still lurking.
For this second problem, we’ll return to our original example, with the Text Manager implemented in native code. Even with the ‘safeguard’ of Marshal.IsComObject, the Search component calls ReleaseComObject and goes on its way. However, the RCW has now been poisoned. As far as the CLR is concerned, by calling ReleaseComObject, the program has declared that the RCW is no longer needed. However, it’s still a valid object, and that means it may be reachable from other managed code. If it is reachable, then the next time ITextManager is accessed from managed code through that RCW, the CLR will throw an InvalidComObjectException with a message of “COM object that has been separated from its underlying RCW cannot be used”.
How can that happen? There are several ways - some common and some subtle. The most common case of attempting to re-use an RCW is when the services are cached on the managed side. When services are cached, instead of returning to the Global Service Provider each time the Text Manager (for example) is requested, the code first checks in its cache of previously requested services, helpfully trying to eliminate a (potentially costly) call across the COM interop boundary. If the service is found in the cache, then the cached object (an RCW) is returned to the caller. If two components request the same service, then they will both get the same RCW. Note that this ‘cache’ doesn’t have to be particularly complicated or obvious – it can be as subtle as storing the service in a field (member variable) for later use.
I’ve called this use of Marshal.ReleaseComObject the “silent assassin” because, while the problem occurs at the point of the call to ReleaseComObject, it is not detected until later when another component innocently tries to access the poisoned RCW. At first glance, it appears that the second component has a bug, but it does not – the component that called ReleaseComObject is the assassin and ‘he has left the room’.
The lesson here is: If you’re tempted to call “Marshal.ReleaseComObject”, can you be 100% certain that no other managed code still has access to the RCW? If the answer is ‘no’, then don’t call it. The safest (and sanest) advice is to avoid Marshal.ReleaseComObject entirely in a system where components can be re-used and versioned over time. While you may be 100% certain of the way the components work today and believe that a ‘poisoned’ RCW could never be accessed, that belief may be shattered in the future when some of those components’ implementations change.
In VS 2010, we scrubbed our code for instances of Marshal.ReleaseComObject and asked component authors to either remove or justify each occurrence. In our own code we found many instances, including in common library code used by managed packages. We were so concerned about the problem of running these components that we actually created patched versions of our Managed Package Framework for VS 2005 and VS 2008 so that, when loaded in VS 2010 they would not have ReleaseComObject problems. You’ll see these patched versions appear as binding redirects in “devenv.exe.config” for Microsoft.VisualStudio.Shell and Microsoft.VisualStudio.Shell.9.0.
Microsoft Distinguished Engineer, Chris Brumme, offered some sage advice about Marshal.ReleaseComObject back in 2003. It’s worth a read because it shows that we were thinking about this problem way back then. In case it isn’t obvious, Visual Studio is in category #2 on Chris’ list at the end of the post.
Mason Bendixen’s Blog also has a nice collection of notes on COM interop and, in particular, this one on RCWs is germane because it talks about the per-AppDomain mapping of IUnknowns to RCWs.
Paul Harrington – Principal Developer, Visual Studio Platform Team Biography: Paul has worked on every version of Visual Studio .Net to date. Prior to joining the Visual Studio team in 2000, Paul spent six years working on mapping and trip planning software for what is today known as Bing Maps. For Visual Studio 2010, Paul designed and helped write the code that enabled the Visual Studio Shell team to move from a native, Windows 32-based implementation to a modern, fully managed presentation layer based on the Windows Presentation Foundation (WPF). Paul holds a master’s degree from the University of Cambridge, England and lives with his wife and two cats in Seattle, Washington.
Can I ask for a favor from the internets? Can we ban all versions of the phrase "X considered dangerous" for all values of X?
Sorry, the internets has denied your request :)
"it has the same kind of lifetime semantics as any other managed object – i.e. it will be cleaned up as and when the Garbage Collector runs"
This is wrong and that misunderstanding is the root of the problem. The Garbage Collector is there to reclaim memory, not to end lifetimes. The lifetime of .NET objects must be determined by their usage; and if cleanup is needed, it must be made at the end of the lifetime. That's what Dispose is for. That RCWs don't implement IDisposable leads to so many problems; and as Chris Brumme notes, ReleaseComObject is not a Dispose either.
Surely the fact that COM objects that are implemented in managed code behave differently than other COM objects is a bug, no?
If the CLR can do magic to make it faster, fine, but then it should do something sane when the COM Marshal.methods are called. The fact that IsComObject returns false, is a bug IMHO.
@Joerg Well said.
Thanks for your comments.
@Joerg, I simplified the story about RCW lifetimes and probably could have worded it more precisely. RCWs are tracked just like any other object allocated on the managed heap. When the garbage collector runs, any RCWs with no live references will be put into the “RCW cleanup list”. The RCWs on this list are waiting for their final Release calls to be made. The sentence you quoted was meant to show that you, the programmer, cannot determine exactly when the final Release call will be made – just as you can’t control when a managed object’s finalizer will run. You’re right that RCWs don’t implement IDisposable, but imagine for a minute that they did. What would Dispose do? Would it prevent the “silent assassin” problem?
@Fowl, “bug or not” is, in this area, a matter of opinion. What’s not debatable is the current observed behavior which is unlikely to change any time soon. The CLR does try to make COM objects and managed objects interchangeable and, for the most part, I think it succeeds very well. There are some areas where the gaps show and this article covers one of them.
@paul Indeed. I didn't mean to sound like I was blaming you ;)
I, the programmer, should be able to determine when the lifetime of an object has ended. The RCW should call Release when its lifetime has ended. Dispose may end the lifetime of the RCW, so it would call Release.
The 'silent assassin' problem is created when people don't manage the lifetimes in their code correctly. If objects are cached, they must not be disposed until they leave the cache (as in C++, an object must not be deleted if it is still used elsewhere). .NET programmers often seem to wish or even to believe that they need not care about the lifetime of objects; and if one doesn't care, it may even "succeed very well for the most part". But not for all parts, and then, we're "out of luck".
Joerg, thanks for the comments.
I claim that the programmer is unable to determine the lifetime of an object except in extremely narrow cases. Apart from that, I agree with pretty much everything else you wrote.
Have I misunderstood Chris Brumme's article? He says that an RCW does its own reference counting. I thought that if I cause the same COM object to be marshalled into the world of the CLR (say) 3 times, then I get back the same RCW 3 times, but that if I call Marshal.ReleaseComObject just 2 times, the RCW doesn't shut down yet because it has its own reference counting scheme (independent of normal COM AddRef/Release).
I thought the whole idea of this is that as long as every action that marshals a COM object into the CLR world across an interop boundary is balanced by exactly one call to Marshal.ReleaseComObject (and as long as you don't make those ReleaseComObject calls until you're actually done with the RCW), you won't see this "silent assasin" problem. The internal ref counting in the RCW means that if some other piece of code unknown to me in the same AppDomain happens to obtain a reference to the same object - maybe both pieces of code got it via the Global Service Provider - the first call to ReleaseComObject won't shut down the RCW -- it'll only shut down on the second call. I had internalised this as having a similar structure to the following rules of COM: any time something returns you a reference to a COM object (whether it's a reference to a 'new' object you've not seen before, or a reference to one you already had a reference to), it effectively comes pre-AddRefed, and so it's your responsibility to call Release exactly once for every reference returned to you. As long as I call ReleaseComObject once (and only once) for every interop call that returns me an RCW, I thought I was good.
I can see how the your example of caching services would cause this problem, but that's just a design flaw in the cache. You can produce an equivalent buggy structure purely in unmanaged COM: I could cache interface pointers returned by the Global Service Provider. If my cache hands out these pointers to other parts of my code without first calling AddRef, but the code that gets these cached pointers calls Release, I've got my COM ref counting out of balance, and so I'll be guilty of using dangling references. The solution is either a) make the cache call AddRef before handing out a reference (which is the normal way of doing things in COM IIRC) or b) write your code in such a way that clients of the cache never call Release - only the cache calls Release. (And obviously if you take approach b) you need a scheme to ensure that the cache knows when items are in use so that it doesn't release them prematurely.)
Solution a) doesn't appear to be viable with a managed client because as far as I can tell there's no Marshal.AddRefComObject. (There's Marshal.AddRef, but that does something else. It's the counterpart of Marshal.Release.)
So surely the solution is b). If you're going to implement a cache that holds on to an RCW, it's your reponsibility to ensure that you don't call ReleaseComObject until you remove the RCW from the cache. (In other words, clients of the cache shouldn't be calling ReleaseComObject, because they don't own the reference. It's not very different from the idea that in Windows Forms, you're not supposed to call Dispose on stock Brushes.) And as long as you stick to that, there won't be a problem, right? It seems harsh to describe this as a flaw in Marshal.ReleaseComObject when it sounds more like a bug in the cache.
Thanks for the detailed comment.
I think you have it correct. Your parenthetical “clients of the cache shouldn’t be calling ReleaseComObject” is absolutely true, and it was violations of that guidance which led to the silent assassinations. Clients don’t always realize they’re dealing with a shared value. More likely, they’re just accessing a property from a manager or container object. Once you’ve investigated a few of these bugs it’s not much of a stretch to strengthen that statement to “no-one should be calling ReleaseComObject” which, of course, is the theme of my article.
Now go back to your second paragraph where you talk about clients of the Global Service Provider. Your concluding statement was “As long as I call ReleaseComObject once (and only once) for every interop call that returns me an RCW, I thought I was good”. This statement is absolutely correct but it is impossible for a client to ensure the “once and only once” part. Clients cannot know whether the Global Service Provider returns a new RCW for each call, or a cahed, shared value. So clients cannot know whether it’s safe to call ReleaseComObject.
From a guidance perspective, I’m very comfortable with this rule as it’s simple to state and easy to verify at the source code level.
Your post is a very thorough and helpful explanation of a very real problem in mixed .NET-COM applications. Unfortunately, its proposed solution, don't use ReleaseComObject, leaves the developers who work on server applications out in the cold.
As an example, I have a message-processing application that was partially migrated from a legacy technology to .NET which uses lots of COM to leverage existing code previously written in C++. At first, I made liberal use of ReleaseComObject in the message processing code which resulted in essentially flat memory usage even during heavy processing but also resulted in the occasional hard-to-locate InvalidComObjectException. As a test, I disabled all calls to ReleaseComObject and, as expected, I experienced no InvalidComObjectExceptions. Unfortunately, my process' memory usage was no longer anywhere near flat. It fluctuated wildly between 3-3.5x the peak memory usage of the process with ReleaseComObject enabled and CPU utilization was noticeably higher. I suspect this was due to the fact that RCWs themselves are relatively small objects and do not contribute much to the memory pressure on the GC even if they represent a COM object with a substantially larger memory footprint in the unmanaged heap.
Obviously, I cannot release this application to my enterprise users and tell them that its memory usage is higher by some large and indeterminate factor. They won't know how to scale their servers to meet their particular load requirements. I won't be able to provide support to the inevitably higher number of problem reports that I'll receive for insufficient memory conditions. And its higher CPU utilization makes it outright slower.
I would love to not have to manually release COM objects, but it seems that the RCW implementation actively works against my particular use case. I know I can manually add memory pressure to the GC, but that is very difficult to do accurately for COM objects that may have native memory footprints that vary significantly in size.
It seems to me that this problem might be able to be partially mitigated by allowing developers to specify a priority for RCW collection. I don't know how that would affect the GC implementation, but I know that for my particular application, 4 of the top 5 classes by memory consumption are COM objects. If all my COM objects with unreachable RCWs were collected first, I might be able to get away without explicit ReleaseComObject calls. Even then, GC sweeps might occur too infrequently because the size of unmanaged objects on the heap does not factor into the memory pressure on the GC.
All in all, I can conclude only that the RCW's attempt to make interop with COM transparent for managed clients works well for interactive applications or applications that use COM objects with small memory footprints. For server applications where throughput and predictable resource usage are important, it makes the problem substantially harder.
What's wrong with setting the reference to null after calling ReleaseComObject on it?
I have another blog which just advises to use this technique (blogs.msdn.com/.../762884.aspx). It seems to me just one big mess mixing deterministic COM with non-determinisitc GC. Side remark: MS is ware of the horrible performance of VStudio 2010? Is this somehow related to the the transition to managed code?