We’ve been seeing some cases over the past year where customers are using CDOEX or EXOLEDB in managed code on Windows Server 2008 and Exchange 2007 and are randomly getting RPC_S_UNKNOWN_IF (0x800706B5) errors. The issue only seems to occur under load. The same code worked fine on Windows Server 2003.
While we still don’t know the root cause, here’s essentially what’s going wrong: In the Exchange Information Store (“store”) we have our COM server for EXOLEDB which services remote calls to it’s interfaces. When you call SaveToContainer or SaveTo, under the hood, CDOEX leverages the ITransactionLocal interface of its OLEDB implementation to do a transactional commit to the store. If you have lots of these transactions going on, at some point what happens is that you release a reference to one of these objects and that drops the ref count on the store-side to 0. Shortly after that, if msado15 is loaded into store, the COM subsystem running in store will ask all the COM DLLs if they’re ready to unload. MSADO15 (which also implements ITransactionLocal) answers “yes” and is unloaded. As part of that process, it unregisters the ITransactionLocal interface from RPC and the next call your client makes on that interface fails with the above error. You’d have to restart the store to get back to a good state. What has made this so tricky is that all the release calls look valid and appropriate. When you’re talking about a scenario that only occurs under load, you’re talking about a lot of release calls. It’s the proverbial needle-in-a-haystack.
So what do you do?
Here are a few options from most immediate relief to more long-term:
Choosing option number 3 is probably your best option for the long term. For one, that guy you hate might go postal if he has to man the Services console for too long. For two, we never recommend as a rule that you should go back to a previous version of a product simply because you lose out on the great security and feature enhancements of the more recent version. That leaves number three, which should be something on your future plan list anyway.
CDOEX along with all the other web store technologies like WebDAV and EXOLEDB don’t even exist in Exchange 2010 and beyond so your application can’t work against the current and future versions of Exchange. These technologies were deemphasized in Exchange 2007 which makes it very difficult to get the product team to agree to invest time creating a fix – especially when the workaround is “easy” (use EWS instead). Your only option is to use EWS in place of your CDOEX code. This does a few things for you. You can support Exchange 2007 and beyond using a highly tested and powerful technology. You now have the option (and it’s recommended) to run your application on a different server than your Exchange Server. Exchange admins will love this. As I mentioned, I’m sure all of you were planning to move to EWS for Exchange 2010 anyway, maybe this problem just pushes those plans up on the list a bit.
Are any of you using CDOEX and getting this error? If so, do you have a consistent repro (the shorter and simpler the better)? If you’re moving your CDOEX code to EWS and have questions about how to do in EWS what you’re currently doing in CDOEX, lets hear them!
We have a product using EXOLEDB where we see this. Some of our customers, experience this exactly as described. However, a common factor we have seen is that they all have network cards with a Broadcom NetXtreme II chip on them (either BCM570X or rebranded, like HP NC382i).
As we cannot just "rewrite for EWS", we are trying to change our code to minimize the chance to get into the failed state. Calls we have seen that use ITransactionLocal are at least Message.Send, DataSource.Save/SaveToContainer, Connection.CommitTransaction/RollbackTransaction.
Now that I read this, I wonder if it's possible to have a service running that keeps ITransactionLocal references alive in the store process so it never reaches 0?
That's interesting about the Broadcom cards, however, when using CDOEX local on the mailbox server, it uses LRPC (local rpc) so it shouldn't ever make it down to the network card.
Is there a reason you can't use EWS instead? It sounds like you're willing to write new code to help solve this problem.
Also, are you using managed code? Just curious.
Yes, we use managed code. We can't use EWS because we rely on synchronous event sinks that EWS cannot provide.
I was actually just able to reproduce it here in my lab with help from Exchange Load Generator (win2008sp2, Ex2007sp2, Broadcom NIC in team, CCR and SAN storage).
For a long time we thought it was connected to clustering, SAN, NIC teaming, but I guess these might just present when there is high load...
Another common factor is that all incidents are physical servers and that we have never seen or been able to reproduce on a virtualized server.
We are more than willing to make changes or workarounds to code, but EWS is a no-go for this particular product, which btw also need to run on Exchange 2003.