It's Schwieb again, this time with a look back deep into MacBU history. No, not any public history -- I'll talking about my own history in MacBU. Brad Post's blog entries about tracking down gnarly bugs made me remember a particular problem I worked on a very long time ago... I've got a lot of references in here to some pretty old OS technologies so for those of you newer or non-programmers out there, bear with me!
So, here's the scene: It is summer 1997. I've been at Microsoft for just under one year, and in the MacBU for about 7 months. We're well past code-complete and working feverishly to hammer out the bugs and ship Office 98. Beyond my few months in the MacBU, all my Mac experience was self-taught on my IIsi in college. I'd taken a course using 68k assembly and helped teach PPC assembly for a few semesters, but had worked my way through the New Inside Macintosh books myself. (Go search the net for UsenEdit or MacGrid, if you really want to embarrass me...)
There I am in my office, poking away at bugs in Mac Excel, when I get assigned this doozy. The steps are:
OK, that's weird... Excel 98 doesn’t actually do anything with Balloon help, especially not during boot. So, I first try to reproduce the bug. Sure enough, when I run through the scenario Excel crashes. The error message is "unimplemented instruction." I'm still trying to deny that the bug should exist and so I try it a few more times, all with the same results. I even try to change some of the steps -- I discover that if Balloon Help is off then there is no crash. Also, if I don't move the mouse in step 4 but leave it hovering over the app icon, then there is also no crash. Very weird!
For those Kon and Bal aficionados following along, care to guess what the problem is yet?
I myself had no clue. At this point I fired up Excel under the Microsoft remote debugger that we were using and tried to look and see what was going on. As my usual practice, I first just let Excel run until it crashed so that I could see what the CPU state was and where we were in the code. Lo and behold, when Excel crashed the Visual Studio debugger showed me complete and utter garbage! I had no stack crawl and no source matching the current address, and the PPC registers looked like they were full of nonsense. When I disassembled the code around the crashed address, none of it was even valid PPC assembly! Did we have a stack trasher somewhere that was corrupting a return address so that we jumped into random memory? I just had no idea.
Next I tried the bug under the MSVC debugger, but this time I stepped through Excel's boot sequence line by line. I saw that we crashed calling HideWindow on the Excel splash screen. Hmmm. The WindowPtr was completely valid, the stack looked ok before we called HideWindow, and yet we died a horrible death. Very very weird!
Anyone with an answer yet?
Ok, since I wasn't getting any good answers out of the source-level debugger, I decided to run Excel with MacsBug enabled. I switched to my 2nd Mac and tried the bug there. Whoa -- we didn’t crash! I try removing MacsBug on that machine and running Excel again. Still no crash! I go back to my primary Mac -- Excel crashes. So I install MacsBug on that Mac and try yet again.
Yep, crash. But things look very strange... MacsBug says we've crashed in 68k code -- in other words, we've crashed inside some code that is still 68k even on the PowerPC. But Excel is 100% fully PPC... Did we crash inside the OS somewhere? SC6 gives some gobbledygook, so I'm still thinking we might have a stack trasher. But why did we end up in 68k-land then? As I take a closer look, I see that MacsBug also says "unimplemented instruction" and the instruction pointer is pointing to an A-Trap. The A-Trap in question is ShowHide.
Ummm, wait, you say... Earlier I said we were crashing during a call to HideWindow, and now MacsBug says ShowHide? So which is it? I have no idea -- my code shows us calling HideWindow and MacsBug says ShowHide... In either case, why does MacsBug think the ATrap is unimplemented? Both system APIs have been around forever! Just for the heck of it, I tell MacsBug to step over the instruction. It works! Huh? Ok, so I tell MacsBug to step a little more. It does, and eventually gets back to PPC code, and suddenly the stack looks right! The PPC assembly even reports a branch to HideWindow.
Ok, now I'm very confused. Two different APIs, two different execution engines, an unimplemented instruction that works... What is going on here...? I tell MacsBug to continue, and Excel actually boots up. No crash! Very very weird!
So I start playing. As I watch the boot sequence very closely I see something not reported in the original bug. During boot, shortly after Excel puts up its splash screen, the OS puts up a balloon help window near my mouse cursor, over the Finder window in the background. Hmmm -- there's the link to Balloon Help. The text is the usual "this window belongs to the Finder. It is disabled because the Finder is in the background" or whatever Balloon Help used to say all those years ago. I also note that the crash always happens on my PowerPC 9600/233 and never on my 8100/100. While I continue testing out various scenarios I discover that MacsBug is quite consistent. If I set a breakpoint in our code on HideWindow and then step over, I end up crashing on a 68k call to ShowHide, but I can always continue and complete Excel's boot. Since the PPC stack always looks OK, I'm pretty sure the problem is not a stack trasher, but instead may be a bad patch somewhere. But where?
I start digging through Excel and find that Excel has patched 8-10 of the core Window Manager routines, including HideWindow and ShowHide. But the patches all look OK and are all purely PowerPC -- there is indeed no 68k code anywhere in Excel. So I dig a little further, and find that the shared Office code has also patched much of the Window Manager, including the same two APIs. In fact, different parts of Office that came from different projects originally have both patched the Window Manager! That looks ugly! But still, all the Office patches are completely PowerPC-based, and should not be causing problems with 68k ATraps.
My lead suggested I contact Apple Developer Technical Support, to see if they could help shed some light on the subject.
Anyone in DTS then or now have a solution? :)
Some time went by, I worked on other bugs, and then I got an email from Apple DTS. The DTS person too suspected a rogue patch but from a different angle. He asked if we had any self-modifying patches. Hmm, I dunno. So I went and dug a bit, and found that *gasp* yes we did! One of the patch sets in Office did not want to allow itself to be called recursively. It prevented this by overwriting its entry point with a jump to the original patch vector upon entry to the Office patch. It then called MakeDataExecutable() to ensure that the code cache for that entry point was flushed.
Ahh, what's that you say? MakeDataExecutable only works on the PowerPC cache? Why yes, that is indeed the problem! The Office patch tried to restore the link to the original vector by inserting a jump instruction in the same architecture as the original vector. If the original vector was a 68k patch we'd overwrite our routine descriptor with a "jmp <addr>" instruction. This meant we had a valid 68k instruction there but an invalid routine descriptor. The problem is that MakeDataExecutable was only flushing the PPC code cache, not the 68k code cache that was maintained by the 68k emulator in the operating system. This also explained why I only saw the crash on one of my machines! The 9600/233 had the newer dynamic recompiling emulator that was actually caching the routine descriptors; the original 68k emulator on the 8100 was not -- it was always fetching the 68k instructions from main memory and would read the modified code. MacsBug allowed us to continue after the crash because it always flushed all the code caches upon entry into MacsBug, thus masking the problem.
However, just identifying the problem was not enough... We had to fix it! DTS discovered that the function call we needed (FlushCodeCacheRange) was not implemented on the PowerPC because it affected 68k code only and thus we couldn't call it. (The funny thing is that TechNote 1127, which addressed this very API, wasn't written until a few months after this bug.) DTS wrote up some glue code that we could use to flush the emulator's cache and *poof* the bug was fixed.
Astute readers will notice that I still haven't identified the link to Balloon Help. Some more back-and-forth with DTS revealed that Balloon Help was the component that was still in 68k mode and it had its own patches on the Window Manager. If Balloon Help wasn't turned on, then there were no 68k patches in the patch chain for these APIs and we didn’t do the corruption dance. The window positioning was important because if the Finder window was under the splash screen Balloon Help wouldn’t put up a balloon about it, and if the mouse was over the application icon there wouldn't be a balloon either. Balloon Help had to be turned on and the mouse had to be over the empty content region of a finder window that was still visible during the splash screen sequence in order to trigger the bug!
Whew. That was a gnarly one to track down, particularly because while I knew the two architectures individually, I did not yet know the Mixed Mode Manager very well. Apple helped us out a lot and I'm sure had a good laugh at the convoluted nature of our patch code.
Looking back on this bug and the technologies involved so many years later, I'm pretty impressed with how far the Mac OS has come. Carbon events, Cocoa, HIObjects, Intel -- it all adds up to a nice robust platform upon which we can build anything. I've been in the MacBU for 10 years now (my 10-yr anniversary is this coming Saturday.) I can't imagine being anywhere else!
(By the way, one of the very first things we did for Office 2001 was rip out all uses of the Trap Manager.)