Often people will ask me a question like "What do you do as an Escalation Engineer at Microsoft?" Being an EE is a really exciting job because I get to see and do a lot of things that most other engineers can't or will never get a chance to do. I felt that I would talk a little about what I do as an EE. A normal day at work for me starts by sending out emails to the 10 or so customer cases that I have in my work bin. Once that is done I will triage some of the other cases that the guys have and offer help if need be. If everything is ok there I will then start on my own cases. One of the cases that I got was going on for quite some time and needed to be solved. Now that I have this case I can do what I do best.

This case was another interesting OAB issue for me as I have not seen this type of behavior before. Basically the problem was the customer was having issues generating the OAB, however this was not the typical OAB generation failure that you would normally see on a server. For some reason the active node on this two node cluster would just stop generating the OAB. Ok cool since this was a cluster the customer failed over to the passive node, and the OAB would generate for a short while and then start failing as well. Failing back the other node *did not* fix it and now both nodes were broken. So now I am starting to wonder what is going on here.

I started up an easy assist session so I could get live access to the customer's server so I could see the logs, etc. I was able to find that her was unable to rebuild the OAB because of a provider initialization error, or so I thought at that time. The errors looked like this:

Exchange System Manager Error
The information store could not be opened. The logon to the Microsoft Exchange server computer failed.
MAPI 1.0 ID no: 80040111-0286-00000000
ID no: C1050000 Exchange System Manager

Application Log Error
Event ID              : 9175
Event Type         : Error
Event Source     : MSExchangeSA
Event Category  : MAPI Session
Description        : The MAPI call 'OpenMsgStore' failed with the following error: The information store could not be opened. The logon to the Microsoft Exchange Server computer failed. MAPI 1.0 ID no: 80040111-0286-00000000

Now when troubleshooting issues like this I check to see if the customer is using the DS Servers registry value to load balance connections to a global catalog server. As this is really not a recommended practice I need to rule it out. For more information on this registry key, see this blog: The DS Server registry key and rebuilding an Offline Address List. What was interesting here is that the customer was not using the DS Servers registry, however he was failing with the same errors as listed in the blog and above.

I set up to do a live debug on the customers server and while rebuilding the OAB I was able to get to the following call stack in the debugger:

1647ecc4 616a9e94 emsabp32!NetBind+0x54
1647ece8 616ad524 emsabp32!BindServer+0x6e
1647ed44 616addb4 emsabp32!DsNetLogon+0x295
1647eea8 616acb5e emsabp32!GetABServer+0x2c7
1647f304 61e4b6b1 emsabp32!ABP_Logon+0x310
1647f358 61e4b477 mapi32!ABCustomLoad+0xde
1647f3f0 61e4bae4 mapi32!HrCommonLoadProviders+0x3d4
1647f494 61e4d34a mapi32!HrIntOpenAddressBook+0x57
1647f50c 61e4ce8b mapi32!MAPILogonX+0x494
1647f54c 0041b6e5 mapi32!MAPILogonEx+0x1b8

What is interesting to me about this call stack is that we end up in emsabp32!NetBind. Now in order for OABGen.dll to generate an Offline Address List, the System Attendant must first do the following things:

1. Initalize the MAPI subsystem and create a system attendant profile.
2. Log on to that profile.
3. Open and initialize the Address Book Provider (loads emsabp32.dll in to the mad.exe process space).
4. Load other customer providers that are needed.
5. Locate a domain controller that we can connect to.
6. Get the binding order from the registry on the local server so we can bind to the domain controller we are going to use.
7. Perform the actual Directory Service bind to the domain controller so we can perform our NSPI queries and request the data we want.

So while I was debugging the emsabp32!Netbind function I noticed that one of the bind operations failed, and the EAX debugger register which contains our return data from the called function contained this error code:

0:029> p
eax=80040111 ebx=80004005 ecx=77ce21ce edx=173c7bf0 esi=149b49e4 edi=00000000
eip=616a9d13 esp=1647ec6c ebp=1647ecc4 iopl=0 nv up ei pl nz ac pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000216

emsabp32!NetBind+0x54:
616a9d13 8bd8 mov ebx,eax

Interestingly enough this is the same error that is being displayed in the dialog box from the ESM, and is also in the application log. If you run 80040111 through the err.exe tool, and you will get the following output:

C:\WINXP\system32>err 80040111
# for hex 0x80040111 / decimal -2147221231
ecLoginFailure                            ec.h
MAPI_E_LOGON_FAILED         mapicode.h

Well, from looking at this error this is partially true because we did fail to logon to the domain controller, but this wasn’t the root cause. Being that I did not have my root cause yet I needed to keep researching. I dumped out the data structures that contained the domain controller information and now I could see where we were connecting to:

0:029> dt pServerGuid
Local var @ 0x1647ecd8 Type __MIDL_nspi_0001*
0x14a3ad10
+0x000 ab : [16] "Domain Controller Name"

Ok, now that I have the domain controller we are using I can dump out the local variables. If you look at the scode, it is -2147221231, which is the same error as 80040111.

0:029> dv
hRpc = 0x173c7b78
pConnect = 0x149b49e4
pServerGuid = 0x1647ee70
pszServer = 0x14a3ad10 "Domain Controller Name"
scode = -2147221231
status = 9
stat = struct __MIDL_nspi_0002
fRetry = 1

I run -2147221231 through error for an extra verification:

C:\WINXP\system32>err -2147221231
# for decimal
-2147221231 / hex 0x80040111
ecLoginFailure ec.h
MAPI_E_LOGON_FAILED mapicode.h
CLASS_E_CLASSNOTAVAILABLE winerror.h
# ClassFactory cannot supply requested class
# 3 matches found for "-2147221231"

Now I am seeing the same the errors all I need to do is a code review to get a better understanding at what we are doing. After I finished my code review I debugged a bit further. I could see that we first try to connect anonymously, then connect with credentials. In this particular case we are failing to connect to the domain controller when using credentials. Once we returned back from this function call I got the true error code which status 1726. Now I am getting somewhere!!!

0:029> dv
hRpc = 0x173c7b78
pConnect = 0x149b49e4
pServerGuid = 0x1647ee70
pszServer = 0x14a3ad10      "Domain Controller Name" <- The domain controller we are trying to bind too!!
scode = -2147221231
status = 1726
stat = struct __MIDL_nspi_0002
fRetry = 0

I ran this error through err.exe as well.

C:\WINXP\system32>err 1726
# for decimal 1726 / hex 0x6be
RPC_S_CALL_FAILED                     winerror.h

# The remote procedure call failed.
# as an HRESULT: Severity: SUCCESS (0), FACILITY_NULL (0x0), Code 0x6be
# for decimal 1726 / hex 0x6be
RPC_S_CALL_FAILED                     winerror.h
# The remote procedure call failed.
# 2 matches found for "1726"

With this information I now have the true error for why we failed to bind to the domain controller, the RPC call failed. All I have to do now is figure out why the RPC call is failing?? The customer and I checked the health of the domain controller and it was just fine.

This is where it gets very interesting. I can't mention certain aspects of the case, but I can say that I noticed that the customer had two applications installed on both cluster nodes that used low level kernel drivers. One of these applications was also reporting errors in the application log. Once we removed the application we suspected to be the problem and restarted the cluster we saw that the problem went away. The customer and I monitored the server closely for about 1 week and the problem never came back. To prove this was the problem we reinstalled the application and within 30 or so minutes of the application performing an update we hit this problem again. What we found out later was that the way the application was applying it updates, it was causing some sort of corruption to the TCP/IP stack (as the application was also hooked in to it) and this was derailing us.

The customer stated that there was no need for this particular application to be installed on the server, and the customer uninstalled it. At this point we were able to generate the OAB every day and the problem was solved!!!

I hope you guys found this as interesting as I did. This is pretty much what I do day in and out as an Escalation Engieer.

Until next time.

Dave