August, 2005

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Where do "checked" and "free" come from?


    People who have MSDN or the DDK know that Windows is typically built in two different flavors, "Checked" and "Free".  The primary difference between the two is that the "checked" build has traces and asserts, but the free build doesn't.

    Where did those names "checked" and "free" come from?  It's certainly not traditional, the traditional words are "Debug" and "Retail" (or "Release").

    When we were doing the initial development of Windows NT, we started by using the same "Debug" and "Retail" names that most people use.

    The thing is, it turns out that there are actually four different sets of options that make up the "Debug" and "Retail" split.

    You have:

    1. Compiler Optimization: On/Off
    2. Debug Traces: On/Off
    3. Assertions: Enabled/Disabled
    4. Sanity checks: Enabled/Disabled

    Traditionally, "Debug" is "Optimization:off, Traces:on, Assertions: on" and "Retail" is "Optimization:on, Traces:off, Assertions: off".  Sanity checks was something the NT team added.  The idea was that there would be additional sanity checks built in for our internal development that would be removed before we shipped.

    So the NT build team wanted to build "Optimization:on, Traces:on, Assertions: on, sanity checks:on" and "Optimizations:on, traces:off, assertions: off, sanity checks: on" and "optimizations:on, traces:off, assertions:off, sanity checks: off".

    The last was what was traditionally called "Retail" - no debugging whatsoever.  However, the question still remained - what to call the "O:on, T:on, A:on, S:on" and "O:on, T:off, A:off, S:on" build - the first wasn't "Debug" because the optimizer was enabled, the latter wasn't "Retail", since the sanity checks were enabled.

    So clearly there needed to be some other name to differentiate these cases.  After some internal debate, we settled on "Checked" for the "O:on, T:on, A:on, S:on" and "Free" for the "O:on, T:off, A:off, S:on" build.  Checked because it had all the checks enabled, and free because it was "check free".

    And as the NT 3.1 project progressed, the team eventually realized that (a) since they'd never actually tested the "retail" build, they had no idea what might break when they started making builds, and (b) since they had perf tested the free build and it met the perf criteria, the team eventually decided to ship the free build as the final version.



  • Larry Osterman's WebLog



    Today, I'm celebrating my 21st anniversary at Microsoft (it was actually Saturday), anyone on campus can feel free to stop by my office and enjoy some of the 21 bagels (and cream cheese) I brought in to celebrate.


    Sorry Esther, I guess I have to turn in my badge :)


  • Larry Osterman's WebLog

    My favorite hardware bug


    Adi Oltean asks: What's your favorite Bug?

    My personal favorite was on the ICL PWS-400.  The ICL PWS-400 was a custom hardware design built by ICL. I was on the team of 5 (two from Microsoft, three from ICL) whose job it was to port MS-DOS 4.1 to this new hardware.  The cool thing about the PWS-400 was that it had some custom hardware that allowed real mode applications to access bank switched memory in 4K pages.  This allowed apps to run in the background without impacting running applications.

    Since the five of us we were the entire development team, we also did a lot of ad-hoc testing.  One of my personal favorites was running a game that Valorie had brought me from school.  I'm not sure which game it was now, but every time I played it, when it got to a specific spot, the machine would spontaneously reboot.

    We put the machine under an ICE (in circuit emulator - a hardware tool that lets you see what's going on inside and outside the CPU) and discovered that the CPU was being externally reset.  That ruled out some wierd software bug.

    The hardware guys took the game and the machine and started looking.

    After a couple of days, they came back to me and announced they'd found the problem.  It turns out that the trace on the motherboard for the PC speaker was too close to the trace on the motherboard for the CPU reset line.  When you played a specific sound on the PC speaker, EMF emissions from the speaker trace would cause the CPU reset to go high, which caused the CPU to reboot.


    Gotta love working with hardware :)


  • Larry Osterman's WebLog

    Modifying software when running as LUA


    Lee Holmes (a developer over in the Monad group) had a problem.

    It seems that he had an application that just insisted on his being an admin to work correctly...

    Being a blogger, of course he wrote about it: here.

    I don't recommend anyone doing this, but it's an interesting approach to fixing a broken application.


    And yeah, I know that if the application was open source, it wouldn't have been a problem.  Just don't go there :)


  • Larry Osterman's WebLog

    The application called an interface that was marshalled for a different thread.

    Another one from someone sending a comment:

    I came across your blog and was wondering if what to do when encountering above error message in an application program.

    The error occurs once in a while when printing out from a Windows application.

    Is there some setting missing in computer administration or in the Registry or can it only be solved in the code?

    Appreciate your help!

    Yech.  This one's ugly.  It's time for Raymond's Psychic Powers(tm) of detection.

    If you take the error message text and look inside winerror.h, you'll see that the error message mentioned is exactly the text for the RPC_E_WRONG_THREAD error.

    If you then do an MSDN search for RPC_E_WRONG_THREAD, the first hit is: "INFO: Explanation of RPC_E_WRONG_THREAD Error".  Essentially, the error's a side effect of messing up threading models.  I wrote about them about 18 months ago in "What are these threading models, and why do I care?". 

    So, knowing that the app's dutifully reporting RPC_E_WRONG_THREAD to the user, what had to have happened to cause this error?

    It means that the application did a CoCreateInstance of an Single Threaded Apartment COM object in one thread, but used it in another thread.

    Given the comment that it only happens once in a while, we can further deduce that the application called CoCreateInstance from a thread in a pool of worker threads, and attempted to use it in a function queued to that pool of threads (otherwise it would fail all the time and the author of the app would have found the problem).  Given that it only happens when printing (an operation that's usually handled in a background thread), this makes sense.

    Unfortunately for the person who asked the question, they don't really have any choice but to contact the vendor that created the app and hope that they have an update that fixes the problem, because there's no workaround you can do outside the app :(

  • Larry Osterman's WebLog

    Why don't critical sections work cross process?

    I could have sworn this was answered in a previous blog by someone else (Raymond, Eric Lippert, etc), but...

    Someone sent me feedback asking:

    Q> Why can't critical section objects be used across processes compared to mutexes?

    Originally, I thought "Man, that's a silly question, it's obvious".

    But then I realized that it's not obvious, because critical sections are different from every other external synchronization mechanism in Windows (there are some internal synchronization mechanisms that share characteristics with critical sections, but they're not public).

    You see, a critical section isn't a native object type in Windows.  All the other synchronization primitives (mutexes, events, semaphores, etc) are native objects - the user mode semaphore (or mutex, or event) is an object maintained by NT's object manager.  As such, it has an ACL, a name, all the things that go with being a native object. 

    And since these synchronization primitives are maintained by the object manager, they can be shared across processes - another process can open a named handle, or you can dup the handle into another process, or you can have the process be inherited by a child process.

    But critical sections are special.

    You see, the flexibility that you get by being maintained by the NT object manager has a cost associated with it - every operation that's performed on the semaphore/mutex/event requires a user mode  to kernel mode transition, as does waiting on the object.

    Sometimes that cost is too high - there's a need for a highly performant lock structure that can be used to protect a region of code.  That's where the critical section comes into play.

    A critical section is just  a structure, it contains a whole bunch of opaque fields.  Inside the EnterCriticalSection routine is code that uses interlocked instructions to acquire the structure without entering the kernel - it's what makes the critical section so fast.

    Of course, the fact that the critical section is just a structure is also why it can't be shared between processes - since it's just a chunk of memory that's in the processes address space, it's not accessible to other processes.

    The clever observer now realizes that this that begs the question: What happens if I initialize a critical section in a shared memory region - after all, it's just a chunk of memory, I can share a memory region between two processes, and just initialize a critical section in the shared memory region.

    This might actually work, for a while.  But the thing about critical sections is that they're more than just a spin lock.  There's also a semaphore that's acquired when the critical section has contention.  And that semaphore isn't shared between processes (actually, the semaphore isn't even "allocated" until there's contention (it's not allocated, per se)).  If that wasn't enough, there are also fields within the critical section that point to other external data structures as well - those structures won't exist in the process that didn't initialize the critical section.  There's no way of knowing what will happen if the other process enters the critical section.  If you're lucky, the process will crash.  If you're not, you might "just" corrupt memory.

    This is a really long answer to a really short question, but sometimes its worth digging into it a bit.

  • Larry Osterman's WebLog

    Larry goes to Layer Court

    Two weeks ago, my boss, another developer in my group, and I had the opportunity to attend "Layer Court".

    Layer Court is the end product of a really cool part of the quality gate process we've introduced for Windows Vista.  This is a purely internal process, but the potential end-user benefits are quite cool.

    As systems get older, and as features get added, systems grow more complex.  The operating system (or database, or whatever) that started out as a 100,000 line of code paragon of elegant design slowly turns into fifty million lines of code that have a distinct resemblance to a really big plate of spaghetti.

    This isn't something specific to Windows, or Microsoft, it's a fundamental principal of software engineering.  The only way to avoid it is extreme diligence - you have to be 100% committed to ensuring that your architecture remains pure forever.

    It's no secret that regardless of how architecturally pure the Windows codebase was originally, over time, lots of spaghetti-like issues have crept into the  product over time.

    One of the major initiatives that was ramped up with the Longhorn Windows Vista reset was the architectural layering initiative.  The project had existed for quite some time, but with the reset, the layering team got serious.

    What they've done is really quite remarkable.  They wrote tools that perform static analysis of the windows binaries and they work out the architectural and engineering dependencies between various system components.

    These can be as simple as DLL dependencies (program A references DLLs B and C, DLL B references DLL D, DLL D in turn references DLL C), they can be as complicated as RPC dependencies (DLL A has a dependency on process B because DLL A contacts an RPC server that is hosted in process B).

    The architectural layering team then went out and assigned a number to every single part of the system starting at ntoskrnl.exe (which is the bottom, at layer 0).

    Everything that depended only on ntoskrnl.exe (things like win32k.sys or kernel32.dll) was assigned layer 1 , the pieces that depend on those (for example, user32.dll) got layer 2, and so forth (btw, I'm making these numbers up - the actual layering is somewhat more complicated, but this is enough to show what's going on).

    As long as the layering is simple, this is pretty straightforward.  But then the spaghetti problem starts to show up.  Raymond may get mad, but I'm going to pick on the shell team as an example of how a layering violation can appear.  Consider a DLL like SHELL32.DLL.  SHELL32 contains a host of really useful low level functions that are used by lots of applications (like PathIsExe, for example).  These functions do nothing but string manipulation of their input functions, so they have virtually no lower level dependencies.   But other functions in SHELL32 (like DefScreenSaverProc or DragAcceptFiles) manipulate windows and interact with large number of lower components.  As a result of these high level functions, SHELL32 sits relatively high in the architectural layering map (since some of its functions require high level functionality).

    So if relatively low level component (say the Windows Audio service) calls into SHELL32, that's what is called a layering violation - the low level component has taken an architectural dependency on a high level component, even if it's only using the low level functions (like PathIsExe). 

    They also looked for engineering dependencies - when low level component A gets code that's delivered from high level component B - the DLLs and other interfaces might be just fine, but if a low level component A gets code from a higher level component, it still has a dependency on that higher level component - it's a build-time dependency instead of a runtime dependency, but it's STILL a dependency.

    Now there are times when low level components have to call into higher level components - it happens all the time (windows media player calls into skins which in turn depend on functionality hosted within windows media player).  Part of the layering work was to ensure that when this type of violation occurred that it fit into one of a series of recognized "plug-in" patterns - the layering team defined what were "recognized" plug-in design patterns and factored this into their analysis.

    The architectural layering team went through the entire Windows product and identified every single instance of a layering violation.  They then went to each of the teams, in turn and asked them to resolve their dependencies (either by changing their code (good) or by explaining why their code matches the plugin pattern (also good), or by explaining the process by which their component will change to remove the dependency (not good, because it means that the dependency is still present)).   For this release, they weren't able to deal with all the existing problems, but at least they are preventing new ones from being introduced.  And, since there's a roadmap for the future, we can rely on the fact that things will get better in the future.

    This was an extraordinarily painful process for most of the teams involved, but it was totally worth the effort.  We now have a clear map of which Windows components call into which other Windows components.  So if a low level component changes, we can now clearly identify which higher level components might be effected by that change.  We finally have the ability to understand how changes ripple throughout the system, and more importantly, we now have mechanisms in place to ensure that no lower level components ever take new dependencies on higher level components (which is how spaghetti software gets introduced).

    In order to ensure that we never introduce a layering violation that isn't understood, the architectural layering team has defined a "quality gate" that ensures that no new layering violations are introduced into the system (there are a finite set of known layering violations that are allowed for a number of reasons).  Chris Jones mentioned "quality gates" in his Channel9 video, essentially they are a series of hurdles that are placed in front of a development team - the team is not allowed to check code into the main Windows branches unless they have met all the quality gates.  So by adding the architectural layering quality gate, the architectural layering team is drawing a line in the sand to ensure that no new layering violations ever get added to the system.

    So what's this "layer court" thingy I talked about in the title?  Well, most of the layering issues can be resolved via email, but for some set of issues, email just doesn't work - you need to get in front of people with a whiteboard so you can draw pretty pictures and explain what's going on.  And that's where we were two weeks ago - one of the features I added for Beta2 restored some functionality that was removed in Beta1, but restoring the functionality was flagged as a layering violation.  We tried, but were unable to resolve it via email, so we had to go to explain what we were doing and to discuss how we were going to resolve the dependency.

    The "good" news (from our point of view) is that we were able to successfully resolve the issue - while we are still in violation, we have a roadmap to ensure that our layering violation will be fixed in the next release of Windows.  And we will be fixing it :)


  • Larry Osterman's WebLog

    Off for the week


    Valorie, the kids, and I are heading back to Albany for the week, so no posts until next week.



  • Larry Osterman's WebLog

    Things I learned while watching Scapina at SCT last night


    One-note-song time again (Larry harping on Seattle Children's Theater).

    Last night, Valorie and I went to see the opening night of "As You Like It" by Shakespeare (adopted by Don Flemming), and "Scapina" by Moliere (adopted by Todd Jefferson Moore).

    I haven't laughed that hard since I saw "The Producers" on Broadway.

    These shows were SO funny.

    As You Like It had huge amounts of physical comedy, and was wonderfully acted.  The actors in the show had almost perfect timing, they really nailed the show.  There's a wrestling match that's straight out of the WWE in the middle of it, that had the audience roaring.

    Scapina, in particular was an eye opener.  You see, it was produced as a joint production of SCT and SCT's Deaf Youth Drama Program.  As such, about half the cast was hearing, the other half was either deaf or hearing impaired.  It was simultaneously signed and spoken - while the deaf cast members were signing, other cast members would speak their lines.

    This play was a total hoot - lots of physical comedy and (quite literally) Keystone Kops style hijinks, Valorie and I enjoyed ourselves thoroughly.

    Some things I didn't realize...

    1. ASL clapping is done by waving your hands, fingers outstretched in the air.
    2. Sidekicks are essentially required accessories for deaf and hearing impaired people - from what I saw in the audience (and on stage), they've ubiquitous.  Whenever an actor in the play received a message, it was on a sidekick.
    3. There is a huge opportunity for physical comedy when the actors in the play need to use their hands to speak.  There's one scene where one character is holding a sausage.  When they have to speak, they hand the sausage to a neighboring actor, recite their lines, and take the sausage back.  Later on in the show, the character complains because it looks like the sausage has been trampled on.
    4. Theater for the deaf is as much about physical presence as it is about skill in reading the lines.  As much of the play was acted out on peoples faces and body language as it was from the words they signed/spoke.
    5. It's REALLY cool seeing hearing and deaf actors on stage together in the same production - it brings a whole new dimension to the show.  I know it's been done in "Big River", but this is the first time I've seen the technique used.  My mom saw Big River at Roundabout, but I hadn't had a chance to see it.  She was quite sceptical before she saw the show but she absolutely raved about it when we talked the next day.

    I know it's late (the shows only run for tonight and tomorrow night), but if you're in the Seattle area, you absolutely should go and see these shows - just show up at the SCT box office at either 6:45 or 8:00 and you should be able to get tickets.

    Edit: Added credit to As You Like It, they got short shrift in the original.


  • Larry Osterman's WebLog

    Re-applying the decal


    Some of you might know this, others might not even care, but one of my private passions is a MMORPG game called "Asheron's Call" that Microsoft originally published with Turbine games.

    Valorie started playing it first (she was an early playtester for it), and I got hooked when I put out my back several years ago.

    It's been a long strange trip, AC is no longer published by Microsoft, it's now being published by Turbine directly, but the game's still pretty fun to play.

    One of the reasons that AC is compelling (IMHO) is that Turbine has not discouraged 3rd parties from writing tools that enhance the gameplay experience of AC - there's an extremely rich infrastructure of tools that enable all sorts of interactions with the game beyond those that the designers anticpated (automating crafting interactions, auto-salvaging of loot, monster detection, etc).  These plugins all run in a framework known as "decal".

    Turbine recently released an expansion pack which rolled out an entirely new client (based on Turbines AC2/D&DO/LOTRO client).  This has broken all the 3rd party applications that were built for AC, because of the massive infrastructure changes that were introduced.

    Adam Wright (Asriel), one of the decal developers has been posting a fascinating series of articles that describe the process that the decal developers are going through to re-create the decal functionality on top of this new client.  It's well worth reading, even if you don't care about playing AC.


  • Larry Osterman's WebLog

    Why does Microsoft "Time Bomb" its beta releases?


    One question that periodically comes up is "Is <x> beta time bombed"?

    First off, what's a time bomb?  It's a chunk of code that's intended to disable a beta release sometime after the beta ships.

    I believe that all MS beta products have to be time bombed, I know that all the products I've worked on recently have been time bombed.  The time bomb can be mild (you lose the ability to send or receive new email), or it can be severe (the product refuses to start/boot), but all beta products are time bombed.

    As far as I know, the reason is buried deep in the roots of Exchange.  When Exchange 4.0 shipped (back in 1994ish) there had been several previous beta releases.  What Microsoft didn't realize at the time was that these early beta releases of Exchange were "good enough" for the sites that were running that beta.

    But the beta had a bunch of bugs in it - not serious enough to make the product unusable, but enough to cause interoperability problems with existing email systems (I really don't remember the details, but the problem was something minor like that the SMTP gateway generating uuencoded TNEF blobs instead of converting to MIME, or something like that).

    We fixed the problem long before RTM, it only existed in the one beta release of the product.  However these beta servers continued to be run by companies that had received the beta for several years after we released the product.  The consequence of this was that Microsoft continued to have people reporting that Microsoft Exchange was producing illegal messages.

    The thing is that no shipping product had those bugs.  The problems were seen because some very small subset of customers hadn't upgraded to the shipping version of the product.

    As a result of this (and other similar problems), Microsoft started time bombing beta releases - that way they Microsoft can guarantee that beta releases don't cause problems long after the product RTMs.

  • Larry Osterman's WebLog

    Life in a faraday cage


    There was an internal discussion about an unrelated topic recently, and it reminded me of an early experience in my career at Microsoft.

    When I started, my 2nd computer was a pre-production PC/AT (the first was an XT). The AT had been announced by IBM about a week before I started, so our pre-production units were allowed to be given to other MS employees (since I had to write the disk drivers for that machine, it made sense for me to own one of them).

    Before I got the machine, however, it was kept in a room that we semi-affectionately called "the fishtank" (it was the room where we kept the Salmons (the code name for the PC/AT)).

    IBM insisted that we keep all the pre-production computers we received from them in this room - why?

    Two reasons.  The first was that there was a separate lock on the door that would limit access to the room.

    The other reason was that IBM had insisted that we build a faraday cage around the room.  They were concerned that some n'er-do-well would use the RF emissions from the computer (and monitor) to read the contents of the screen and RAM.  I was told that they had technology that would allow them to read the contents of an individual screen from across the street, and they were worried about others being able to do the same thing.

    Someone at work passed this link along to a research paper by Wim van Eyk that discusses the technical details behind the technology.


  • Larry Osterman's WebLog

    Another pet peeve. Nounifying the word "ask"


    Sorry about not blogging, my days are filled with meetings trying to finish up our LH beta2 features - I can't wait until people see this stuff, it's that cool.

    But because I'm in meetings back-to-back (my calender looks like a PM's these days), I get subjected to a bunch of stuff that I just hate.

    In particular, one "meme" that seems to have taken off here at Microsoft is nounifying the word "ask".

    I can't tell how many times I've been in a meeting and had someone say: "So what are your teams asks for this feature?" or "Our only ask is that we have the source process ID added to this message".

    For the life of me, I can't see where this came from, but it seems like everyone's using it.

    What's wrong with the word "request"?  It's a perfectly good noun and it means the exact same thing that a nounified "ask" means.


  • Larry Osterman's WebLog

    Dos Ain't Done 'til Lotus Won't Run


    I was originally going to do a post on this, but Adam (who interviewed me on the topic before he came back to Microsoft) just posted this article and did a far better job of it than I could have ever done (he actually went out did research and stuff).

    So go to Adam's blog and enjoy Adam's thorough debunking of a canard.


  • Larry Osterman's WebLog

    Where do they get those names?


    I've stayed out of the Windows Vista naming thingy, I figure that the name will grow on me.  But whatever you might think of Windows Vista, it's got to be better than:

    Microsoft Windows Server Base Operating Systems Management Pack for Microsoft Operations Manager 2005

    I apologize in advance to anyone on the MWSBOSMPfMOM team, I'm quite sure that it's a wonderful product (pack?) and is critical functionality, but surely Microsoft could have come up with a better name for that one :)

    Sorry, I'm in spec purgatory today, and for some reason I felt like venting a bit :)

    Edit: I've been informed by the MOM team that this is a poor example of a badly named product, since it's an add-on for a separate product (MOM 2K5).

    So I'll pick on someone else instead: :)

    Microsoft® WinFX™ Software Development Kit for Microsoft® Pre-Release Windows Operating System Code-Named "Longhorn", Beta 1 Web Setup

    Thanks Jonathan :)


Page 1 of 1 (15 items)