Sorting it all Out Michael Kaplan's random stuff of dubious value Be sure to read the disclaimer here first!
A little over two half decades ago, I made a particular technical decision for a project I was working on at Microsoft.
I mention the reason over a half decade ago in a blog in this Blog o' mine.
And a little under two days ago, a blog by Eric Lawrence brought it all home to roost.
His blog on EricLaw's IEInterals titled Brain Dump: Shims, Detours, and other “magic” is a good read, and describes a fascinating bug involving IE10, a third party extension IE10 ships, and MSLU, the Microsoft Layer for Unicode.
You can read the full blog (it's a good read!) but I'll quote the relevant portion here:
I spent several hours pondering this question and aimlessly touring around in the debugger. I was whining about this scenario to a colleague, complaining about code so ancient that it was shipping with unicows.dll, when I realized that I’d never used this library myself, and in fact I’d never seen a toolbar use it before. When trying to explain what it did to the colleague, I decided that I’d probably stop hand-waving and pulled up unicows up on Wikipedia. And bam, there it was, plain as day:
By adding the UNICOWS.LIB to the link command-line [ ... ] the linker will resolve referenced symbols with the one provided by UNICOWS.LIB instead. When a wide-character function is called for the first time at runtime, the function stub in UNICOWS.LIB first receives control and [ ... ] if the OS natively supports the W version (i.e. Windows NT/2000/XP/2003), then the function stub updates the in-memory import table so that future calls will directly invoke the native W version without any more overhead.
…and there’s the problem!
When IE first loads a toolbar, the shims run against the module and wrap all calls to CreateWindow with a call to the compatibility wrapper function. But when IE loaded this toolbar, it didn’t find any calls to CreateWindow, because those calls had been pointed at a function inside unicows.dll instead of at the original function in user32.dll. As a result, the compatibility shim wasn’t applied, and the function call failed.
Now, this wouldn’t have happened if unicows did its import-table fixup the “normal” way, using the GetProcAddress function. That's because the compatibility shims are applied to GetProcAddress as well, and the fixup would have been applied properly at the time that unicows did the update of the import table. However, for reasons lost to the mists of time, the implementers of unicows instead copied the source code of GetProcAddress from user32 into their own DLL, so the shims had no way to recognize it. While we could add a new shim to handle unicows.dll, the obscurity and low priority of this scenario mean that we instead decided to outreach to the vendor and request that they update their build process to remove the long-defunct support for Windows ‘9x.
Well, I'll object a little about the characterization that things that feel so recent to me are "lost to the mists of time". :-)
Though I won't complain too much, since the issue in question caused him to be randomized so obnoxiously!
The blog my mine that covers the issue is from point one of May of 2005's Why does MSLU wrap ________ ?:
1) There is, for example, the GetProcAddress function. It takes a string, but never a Unicode string, on NT or otherwise. So why would it need to be wrapped? Well, it turns out that the GetMonitorInfo function, defined in multimon.h, is not just a simple prototype. There is a bunch of complex code in it that conditionally calls various APIs, including GetProcAddress, to get a function pointer to replace any call to GetMonitorInfo. Because of this, MSLU could not wrap the GetMonitorInfo function, because the wrapper would never be used. The only way to allow the to wrapper to work was to wrap GetProcAddress and look for where someone was trying to retrieve the address of GetMonitorInfoA or GetMonitorInfoW!
This was back in the heady days when I had the DaveC like power to have influence on pretty much any function in multiple versions of Windows.
Even if the versions were Windows 95, Windows 98, and Windows Me.
I suppose there is a small procedural problem with trusting a troubled perfectionist such as myself to act as sole architect/program manager, principal developer, and only tester on a project.
But my manager at the time had quite a knack for making me feel slightly foolish while asking questions that in retrospect seem quite reasonable like
Don't we need someone with PM experience here?
Should I really be in the only one in charge of testing code I wrote myself?
while simultaneously making me feel like I could get the job done.
So perhaps I can be forgiven this particular sin.
Though really I think I owe Eric lunch one of these days to apologize.
Eric -- sorry about that! Call me after I get back from Brisbane in a couple of weeks! :-)
So why couldn't MSLU call through to the OS GetProcAddress function?
Because there was no way to hook the multimon functions without hooking GetProcAddress.
Why did that hook make it impossible?
If you look at the SDK header file for multimon, you'll understand. It contains complex delayload logic that unicows.lib could not change except by overriding GetProcAddress.
When I first read that first sentence, my mind silently inserted "and a" between "two" and "half", which made me wonder at how you were working at MS when you were 16. :)
Yuhong, it says right in the quoted article (or rather it says that no one knows why):
However, ***for reasons lost to the mists of time***, the implementers of unicows instead copied the source code of GetProcAddress from user32 into their own DLL.
I know. I am asking why the override make it impossible to call the *original OS* GetProcAddress function.
Check out the CODE in the header file, you will quickly understand. I promise. :-)
Yuhong, where did you read that it was impossible for MSLU to call GetProcAddress? All it says is that they didn't do it, and no one remembers why.
Actually, Eric saw that MSLU was wrapping it.
The question is why the multiple monitor support did the strange thing. The answer is because multiple-monitor support was added in Windows 98 and the MultiMon.h header was written to detect whether real support was available, and fall back to a basic implementation if not.
The implementation #defines GetMonitorInfo (for example) to xGetMonitorInfo, which is implemented in-line in the header. However, this is an ANSI/Unicode function due to the szDevice member of MONITORINFOEX. An ANSI build dynamically loads the underlying GetMonitorInfoA if it's available, while a Unicode build loads GetMonitorInfoW, which doesn't exist on Windows 98. In this case a Unicode program linked with unicows would not support multiple monitors (because the test is for the presence of that API) whereas the ANSI build would. The only way for MSLU to do the right thing is to intercept GetProcAddress.
I'd have to argue the case of whether this API really needed A/W variants since the szDevice field is the display driver name, not a user-generated name. Still, that's what was done and that's the compatibility issue.
? When you hook a function you need to leave a way to call its base. Unicows.dll should have done it that way. It can still be fixed now since the only reason you care is for shims.
This is the section I was referring to:
"Now, this wouldn’t have happened if unicows did its import-table fixup the “normal” way, using the GetProcAddress function. That's because the compatibility shims are applied to GetProcAddress as well, and the fixup would have been applied properly at the time that unicows did the update of the import table. However, for reasons lost to the mists of time, the implementers of unicows instead copied the source code of GetProcAddress from user32 into their own DLL, so the shims had no way to recognize it. "
MSLU took over the GetProcAddress function, and then instead of calling the OS version of GetProcAddress when it determined that it wasn't a function that it cared about, it re-implemented GetProceAddress itself. Yuhong is asking why it couldn't call the OS version instead of implementing it itself. We don't know that it *couldn't*, we just know that it *didn't*. It could be that the author simply *thought* that it couldn't do so.
It was easier (well, better performance) to copy it then do a string compare on every call to check for a particular function...
Hi, Michael-- Thanks for sharing your insights on the history of this code! (FWIW, IE10 doesn't ship the extension in question, we just found it during our compat-test pass.)