|
|
or: How I Learned to Stop Blaming Windows and Love the BSOD
-
In Part 1, I discussed a bit of the history and function of SMIs. How does this make them EEEEVIL, is the question?
Essentially, SMIs are the final word in what happens on a CPU, outside of removing power. They cannot be interrupted, even by a Non-Maskable Interrupt (NMI). Also, since they are not assertable from within software, it's impossible to use them or detect when they happen. Essentially, the BIOS has control over everything that happens when it takes over. Since it is it's own execution mode, the assumptions and mechanisms of the previous ones are ignored. Specifically, this means any hardware breakpoints you may have set in your debugger will not fire based on anything that is happening in SMM.
Now, when SMM was originally used only to implement power savings via Advanced Power Management (APM), this wasn't a huge problem. When it became a problem was when BIOS makers and their OEM's started using this ability to implement other functionality via SMM trickery. The most common application is implementing a USB keyboard handler for real-mode operation. This also happens to be one of the most frustrating issue we see, as it can cause any variety of problems with the system's normal operation.
To understand why, think of the implications of an undetectable Hypervisor mode that has full access to the system. Necessarily, to implement a keyboard handler like that, it needs to touch the hardware. This means meddling with registers on devices, and even physical memory. Now if you implement the perfect SMM handler for this kind of work, fine. If you have a bug however, havoc can ensue. You can be running along in a critical, essentially non-preempt-able code path, and from one instruction to the next have a section of memory or a hardware register changed out from under you. This can result in all kinds of strange issue, from a crashed application, to a bluescreen, to a hang.
I'll cover some of the more common problems and symptoms in another article.
|
-
As a quick introduction, SMIs were introduced to the x86 world by the 386SL. It was created to allowed systems designers to have access to the CPU while unspecified software of any type was running. The reasons for this are obvious when you look at the market the 386SL was aimed it. It was Intel's first attempt at a truly mobile CPU. SMIs allowed the BIOS to control various aspects of power management on the CPU, regardless of what kind of OS was running on top of it. That was a good thing in the days when DOS still ruled the land. DOS knew as much about power management as your average light bulb, and letting the system designer control how and when devices were turned on and off seemed like a great solution. The problem is how it was implemented
To implement SMI, Intel created a new interrupt pin on their CPU, appropriately named the SMI# pin. When this pin was asserted, (turned on, essentially) the system would halt everything it was doing, save state, and transition into System Management Mode (SMM). SMM is essentially another entire operating mode for the CPU, just like Real Mode, V86 Mode, and Protected Mode. The big difference here is that this mode can't be signaled from software, and everything that happens is 100% transparent to software. Once the system enters into SMM mode, it's truly like time stands still. Everything you know about the state of the system can change from underneath you from one instruction to the next. This includes every part of the OS, which generally assumes, (rightfully so) that things will always be a certain way until it says otherwise.
As you can imagine, this requires the systems programmer to get everything right, or disaster can ensue. Needless to say, the reason I am writing about it is because it often does. I'll go into some of the weirdness you can see, and what you might do about it in part 2.
|
-
Want to know why I started posting again just now?
Adi Oltean posted a great entry about his favorite hardware bug. This prompted Larry Osterman to post his favorite, and I started feeling left out. I have a ton to choose from, after all I deal with new bleeding edge hardware on a daily basis. I'm hard pressed to call them my favorites, because they generally cause me to suffer, but it's my job so I can't complain too much. They can be a ton of fun to work on too, much more abstract and asynchronous than just working on dry code.
Unfortunately, I can't really share the best ones since they're still pretty fresh on the market. All the stories would start like this: There was this one time a few <CENSORED> ago, when our OEM, <CENSORED>, was about to ship a new <CENSORED>, and it had this really strange behavior when you <CENSORED>...and end like this: In the end, we fixed it. OR: In the end, they hoped not many people would see it and just shipped anyway.
Riveting stuff, eh? I know I could be more generic, but since I work with these vendors on a daily basis I don't want to chance it. I can tell you what is the bane of any operating system developer: System Management Interrupts (SMIs) BIOSes that make heavy use of SMIs can wreak havoc with an OS, and there's very little in the way of figuring it out, and no way to stop it except a new BIOS. I'll go into the pain and misery of SMIs in my next post.
|
-
Well, I sort of had to stop blogging for awhile there because I moved on to a slightly different role. I have the same job at the end of the day, but now I support more general portions of the OS, and of course one of the things I enjoy most: storage. This has always been one of my strong suits as a debugger, and over the last year I've had plenty of opportunities to improve. My group now owns support for most SAN/NAS solutions, storport.sys, scsiport.sys, and our new iSCSI 2.0 solution. Expect to see more posts in the coming days about these topics. I'm also still interested in random other debugging and hardware topics of course, since as I've said before, it's almost as much my hobby as my job. Those topics will never go away.
I'm going to attempt to be slightly more spontaneous with my posts, to keep myself from getting into a rut where I refuse to post because I am so busy and don't have enough information to make an informative post. That said, I'll try not to walk off into the weeds of random discussions too often. Look forward to more random hardware information shortly.
|
-
I know this is a slightly more esoteric topic, even for me, but I want to address cc:NUMA platforms, and how they matter to Windows and Windows applications. What is NUMA you ask? NUMA stands for Non-Uniform Memory Architecture. (The cc: stands for Cache Coherent, by the way, because there is non-cache coherent NUMA as well, but I won't address that here since there are no Windows support platforms that are non-cache coherent.) To understand why NUMA exists, we need to look at Symmetric Multiprocessing (SMP). SMP has a few core principles it is built around, and one is that every CPU in the system has an identical view of the system. Memory, I/O subsystem, and other CPU's can all be treated the same by software. The problem comes when this assumption is no longer true. As you scale up the size of a system, it becomes harder and harder to keep everything close together, literally. The more switches and busses your data flows through, the longer it takes. This fact means that in order to squeeze the maximum amount of performance out of the system, it behooves the OS as well as the programmer to try and keep data as close to the place where it's needed as possible. By keeping track of which pages of memory and CPU's have the best locality to each other, decisions can be made when threads are scheduled and memory allocated that will squeeze that extra little bit out of the system. Until only a few years ago, this was exclusively the realm of large mainframe style computers, not the PC world. But with the introduction of the Unisys ES7000 in 2000, the PC suddenly had something to benefit by being NUMA aware. Even then, this was something that mostly concerned large scale-up server implementations, not the average user or programmer. That is, until AMD announced their unique implementation of their new Opteron and Athlon64 processors. Suddenly, any system that has more than one of those CPUs could potentially benefit from NUMA optimizations. I'll go into why in the next entry.
|
-
I just got another first-hand experience in the difficulty of trying to affect computing through social engineering. Our fax forwarding people do their forwarding based on the cover letter. Whoever is listed on the From: line gets a TIFF of the fax forwarded to their e-mail inbox. Normally, this works great. However, just a few minutes ago, a customer sent a fax where the cover letter just had a generic destination. Since there was an e-mail alias that matched this generic description, the fax got forwarded, and targeted literally thousands of people. My first reaction was confusion, since I hadn't looked at who it was intended for. (After all, you usually don't get a fax to a distribution list.) I looked it over, and when it didn't make any sense I finally looked at who it was intended for. I laughed, and thought about all the confused souls out there who will be getting this too. Then, the inevitable happened. Someone replied to all, saying the fax wasn't meant for them. Then the flood of Me Too's came. It reminded me in a painful way of the Bedlam DL3 fiasco that happened right after I joined Microsoft 7 years ago. The Exchange team covers it in all its painful detail here. People who have been here long enough, and are "technical" enough to know better, are just as guilty of replying to everyone as anyone else. In this case, the smaller scale, Exchange 2003 servers, and the quick action of others prevented the disaster that Bedlam caused. Still, I can't help but wonder how we're supposed to prevent people from taking self-damaging actions when it seems it's part of human nature to do so, and ignore the mistakes of the past.
|
-
Ok, my arm is warm now. Time to start tossing some theory bombs out there, and hope none get picked off. They said Italians couldn’t quarterback, but look at Vinny Testaverde! (Err…no, don’t.) The reason treating support processes like a manufacturing endeavor fails is because it doesn’t take into account the sheer mass of uncontrolled variables that go into fixing software, IMO. In an ideal world, you could have an unlimited number of truly gifted people performing every job function. In the real world, you have to balance the needs for customer service with technical savvy, troubleshooting with good communication, prompt response with cost, etc. People aren’t machines. Taking a deterministic approach to creating your workflows means “sticking your head in the sand” to the variables out of your control, and overloading the system should a bottleneck occur. Example: If you require all cases go to your next tier of support at N days, but you neglect to factor that all of Europe takes August off, your small, specialized upper tier will be overloaded with issues from people who aren’t answering phones, (and when they do it will all be on the same day.) Going in the opposite direction is fraught with peril as well. As anyone who has worked in a helpdesk or support environment knows (at least from the armchair quarterback standpoint), the only consistently effective way to improve the quality of support is to have more people available to handle your volume, many of whom are aces at what they do. As anyone who has been involved in managing a project like that knows, doing so is so expensive that you’d trash any profit your product might make. So the question is how to strike the right balance. The problem is, I don’t think there’s a good answer, at least not in terms of traditional support. The real answer lies with software. I’ll go into that next.
|
-
Ok, I know I said when I started this blog that I wouldn’t be going into the support aspects of my job much, but I lied. I can’t resist being an armchair quarterback, so I am going to warm up my arm today, and start tossing Hail Mary’s tomorrow. Just remember, this is coming from a tech guy with an eye for process, not a manager in charge of making the tough decisions. :) If there’s one thing that has been consistent in my 7 years in support, it’s that processes rule the day. Not the kind that people like to debug, the kind that tells you what you should do and when. After some grim time fighting the system, I came to see why they are important. After all, how can you manage and understand your costs when you don’t know what’s going on and why? It’s not like creating a product where you can see something coming out the other end of the line, like software, or toothpaste. So, you need to put some rules around your workflows, and figure out a way to get a very ephemeral and soft result: Happy Customers. I accept that now, even if it cramps my style. My problem comes from the data used to make some of these process flow decisions. Your choices can only be as good as your data, unless you get lucky. Playing carefree games with your data doesn’t help you make good decisions. Too many people spend a lot of time trying to find the best way to get the data to fit their ideas and preconceptions, and make horrible decisions based on their “findings”. Richard Feynman famously called it Cargo Cult Science. The result is often a system that doesn’t help customers, and makes employees unhappy to come to work every day. One of the biggest culture-shock changes comes when someone who is familiar with manufacturing process comes in and tries to “shape things up.” It usually doesn’t work out well, and I’ll go into why I think that is in my next entry.
|
-
This is something that most people in the mainframe business have taken fom granted for decades now. To the PC world, it’s relatively new…and to the PC OS world, even newer.
Starting with the Pentium and Pentium Pro, Intel introduced the Machine Check Architecture (MCA), which was a way for the CPU and other components of the system to report internal inconsistencies to software, so that the operating system can make decisions about how best to protect the user and data and/or report the problem. For full information on how this works, see the IA-32 Architecture Software Developer’s Manual Volume 3: System Programming Guide, Chapter 14.
Now, that’s all well and good, but unfortunately Windows didn’t support anything but the most basic level of reporting until Server 2003. Before that release, we would stop the system is a fatal error occurred, but not much else. With Server 2003 however, the reporting mechanism became more sophisticated.
If your processor and platform support it, we can read and log events into the event log to tell you more clearly what happened. This might seem redundant, but not all Machine Check Exceptions (MCE) are fatal. Some are just informative. For example, you could have one particular region of memory that keeps returning corrected parity errors. Corrected is great, no problems with your data. The fact that they keep happening? Usually bad news. Go get it replaced!
The worst-case scenario, of course, is an unrecoverable error. Those are reported with a STOP 0x0000009C. If you encounter one of these, it’s best to contact your OEM instead of Microsoft. There’s really nothing we can do. This is a hardware problem, always. We might be able to help interpret, but it’s not likely. If the system is critical, get to hardware swapping.
|
-
I know, another title that seems ridiculous. Why in the world would anyone want a button that intentionally bluescreens your system?! When you’re confronted with a hard hang though, (no mouse or keyboard) you’re in for a heck of a time trying to figure out what’s wrong without one. That’s where the NMI button can come in handy.
Many people are already familiar with the mechanism introduced in Windows 2000 for these kinds of issues. The gist is that by setting a registry key, you can enable a key sequence (at the local keyboard only) that will bluescreen the machine. Thus if you’re having problems with hangs, you can get a memory.dmp and send it to your OEM or Microsoft for analysis.
However, this mechanism can’t cover every scenario that will result in a hang. The keyboard interrupt is typically a fairly low priority on the system in relation to the rest of the devices. If your hang isn’t the result of a deadlock in the kernel itself, the key sequence will never get through and initiate the crash. It’s simply too easy for other devices and drivers to turn off that interrupt while doing their own I/O.
This is where the Non-Maskable Interrupt (NMI) comes in to save the day. As the name implies, this is an interrupt that cannot be hidden by software. When the interrupt is generated, the CPU will always get it, and the interrupt handler (which you also must explicitly enable in the registry) will start the process of bluescreening the box. It will then break into the kernel debugger if attached, or generate a STOP 0x00000080 blue screen if not.
Now if the NMI doesn’t work, you can be confident that something is seriously wrong with your system, and it’s probably hardware. The CPU typically has to move into an unknown state for this feature to fail. It’s time to contact your hardware vendor, and quick. If you’re wondering why no one uses this feature, you’d be surprised. A number of major server vendors do in fact ship systems with this button, but they keep it hidden (for good reason) and don’t really use it as a feature to sell the box. They consider it purely diagnostic.
Personally, I’d want every system in my server room to have this mechanism. I don’t want 2 or 3 hangs before I can even begin to troubleshoot. I want it done the first time, every time.
|
-
A comment from the earlier memory management entry posed a good question. How does PAE factor into the new No Execute (NX) mechanism enabled by the Opteron, Athlon64, and new Prescott-based Xeon?
In Windows XP SP2 and Server 2003 SP1, the two are inexorably linked. The two level address translation scheme used by the non-PAE kernel does not have enough room to accommodate any further descriptive information about individual pages of memory. The three-level scheme that PAE necessitates allows the new NX attribute to be used. (It is simply a bit in the Page Table Entry (PTE) that indicates the memory in this location is not allowed to be referenced by the instruction pointer. “Do Not Run Under Penalty of Death”.)
When you use the /NoExecute switch on these OS’s, ntldr now loads the PAE kernel, but in a special mode. You don’t get access to over 4GB of RAM, and more importantly, your drivers don’t get physical addresses over the 4GB mark either. This is important because we’ve found that many devices and device drivers, especially in the consumer space, happily assume they’ll never have to address memory at an address over the 4GB boundary.
While you may only get to use a total of 4GB RAM in XP, that doesn’t mean that some of it can’t have a physical address above the 4GB boundary. The BIOS or devices may re-map memory up there with the assumption that it won’t be seen or used. When the PAE kernel starts handing out addresses to those pages of memory, things can get ugly. The easiest way to ensure everything works like it did in the past, while allowing the new feature, is to make sure we don’t hand out any addresses over that boundary.
If you add the /PAE switch, you get the normal PAE behavior, and all bets are off. Of course, this is exactly what you want in the server space; after all, you got that extra RAM for a reason, right? Also note that the properties aren’t transitive. While both /PAE and /NoExecute use the same kernel file (ntkrnlpa.exe or ntkrpamp.exe) and address translation mechanism, you need both switches in place to enable both features.
[Added 8/6/2004 @ 2:56PM, thanks to Adam]
Note that the above information applies only to NX on processors in x86 mode. The IA-64 and the x64 platforms natively support NX, which is being enabled for the first time with the release of Server 2003 and XP for 64bit. Those platforms both already use a 3-level address translation scheme, but they're not related to PAE in any way. They address 64bits natively, and the memory structures have room to include the NX information. We just needed to add the support to the OS.
|
-
You should get out of the PC kitchen. This is another silent system killer that most people don’t want to acknowledge. (Though I will admit it’s gotten easier the last 2-3 years, as Intel, AMD, nVidia, and ATI have cranked up the wattage to the point where even the most stubborn have to recognize heat as a design issue.) While not often a problem for a brand-name computers (which are built with high tolerances and an eye towards good heat dissipation characteristics), it can kill a homemade desktop or server. I won't go into solutions, just explain why it's hard to work on these when you're on the other end of a phone line or e-mail thread.
This is another one that can be a nightmare to work on, at least from the perspective of someone troubleshooting the operating system. The way it manifests itself is very similar to random memory problems: Blue screens and access violations with no discernable pattern. The way I usually go at it? Open the case up and stick a big ol’ box fan pointing into the case. Low tech? Sure. Effective? Heck yeah.
One of the axioms we live by is that a software problem should be consistently reproducible. Sometimes figuring out the parameters for reproducing a problem can be tricky, but if we see a closely related set of behaviors around multiple failures, you can feel good that it is something you can fix in software. Bad hardware on the other hand, plays by no ones rules.
Someone taking a 30,000 view of the problem might say: “It is consistent. I run for this long, and it always blue screens!” When we dig into the details though, a different picture emerges. What the CPU was doing at one time, in terms of software, could be drastically different. Running notepad, SQL, minesweeper, core OS functions, it doesn’t matter. You have to look at the state of the system itself, and see if the CPU is doing exactly what it should be, or if RAM has conspicuous patterns that don’t match anything software would likely create, or a device is returning noise instead of data. Getting to the root cause of a problem like this can be terribly difficult without the right tools, especially when you only have a snapshot of the system provided by a memory.dmp file, instead of the live (or more appropriately, freshly dead) system sitting in front of you.
|
-
IMO, it's not what anyone else might think. SQL, Exchange, and Web Services get all the hype, but I think Terminal Services will get the most immediate benefit from the backwards-compatible nature of the x64 architecture. Let's look at some of the benefits that the platform provides over x86 and IA-64:
- Huge VA memory space for the whole system. The current limits for x86 TS implementations is limted kernel memory space. Once you run out, that's it...no more users. What surprises many people is that it often happens well before 4GB of RAM gets exhausted. Forget about /PAE, that just makes the problem worse.
- Full-speed execution of 32bit x86 code, with full Win32 support. This means no one has to port business apps to see the benefit of the new platform. You can move to it on your own schedule. This is the way most Enterprises want to work.
- Shared pages for 32bit code. Besides the slow emulation solution for current IA-64 platforms, there's also a problem in that there are no shared code pages, due to the page size difference from x86 to IA-64 (4kb for x86 vs. 8kb for IA-64). That means every copy of Outlook needs 150MB, whereas on a shared memory system, many of those pages could be used by different processes for hundreds of users.
- No Execute (NX) means that there's an added level of security that wasn't available previously. This might cause heartache for some in-house apps, but thanks to the configurable nature of the NX support, most companies can use this feature to their benefit, without having to purchase all new workstations.
I don't have numbers, but I expect that you would get FAR more users onto a 4-proc (Opteron) 4GB or 8GB machine running 64bit Server 2003, than the same machine running 32bit Server 2003.
This posting is provided "AS IS" with no warranties, and confers no rights.
|
-
That could probably help 90% of the developers at Microsoft, to be honest. Kernel mode debugging is sometimes equated to black magic for devs who spend most of their time in the highly friendly (and deterministic) world of user mode.
An analogy I like is to compare kernel mode to cutthroat corporate life, and user mode to bucolic academic life. In user mode, there are rules that you can’t break, and are enforced from the outside by the faculty (the OS). Breaking the rules doesn’t take down the whole university, only one member. You only hurt yourself.
On the other hand, kernel mode has a set of agreed upon rules, but they’re not nearly as strongly enforced…results are more important after all! You’re expected to follow them, you can cheat if you want (and even get away with it for awhile), but when you screw up, chances are you’ll take down the whole enterprise.
You can’t take anything for granted in kernel mode, because there can be any number of kids playing in the same sandbox. Someone can take your CPU away, take your memory away, dump garbage on your data, and not even call you in the morning. Paradoxically, while it’s more critical here than anywhere else that everyone follows the rules, it’s not in our interest to strongly enforce them. Too much error and behavior checking here could bring the system down to an unusable crawl. So we let driver writers have the power to destroy worlds, and hope they use them for good and not evil.
As anyone who has used Windows NT and up knows, this doesn’t always go well.
In coming entries, I’ll cover some of the basics of how to open and analyze memory dump files, so you can at least feel like you have a starting place when you get a blue screen on one of your systems. I’ll move on to more advanced topics if there’s an interest.
|
-
On my way into work today, I realized that Seattle traffic is one of those human systems that visibly conform to the second law of thermodynamics… OK, that’s a gross oversimplification, but it’s still amazing to watch people intentionally move from a high energy state to a low one, with no good explanation.
Example: On the exit I take to get to work every day, there are two lanes that turn left. Invariably, there is a line of 20+ cars in one lane, and 1 or 2 in the other. Why no one else uses the other lane is beyond me. I might understand better if it was always the same lane, but it often changes.
Another Example: If someone is doing 60Mp/h in their lane, in a line of cars, with good spacing, and a spot opens up in another lane (left or right)…Someone will invariably dive into that spot, but not change speeds at all, thus blocking that lane for faster traffic.
On a related note, if you ever need to calculate the noise from a segment of road you’re planning, the National Physical Laboratory in the UK has the calculations for you here.
|
|
|
|