Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Threat Modeling Again, Threat Modeling Rules of Thumb

    • 12 Comments

    I wrote this piece up for our group as we entered the most recent round of threat models.  I've cleaned it up a bit (removing some Microsoft-specific stuff), and there's stuff that's been talked about before, but the rest of the document is pretty relevant. 

     

    ---------------------------------------

    As you go about filling in the threat model threat list, it’s important to consider the consequences of entering threats and mitigations.  While it can be easy to find threats, it is important to realize that all threats have real-world consequences for the development team.

    At the end of the day, this process is about ensuring that our customer’s machines aren’t compromised. When we’re deciding which threats need mitigation, we concentrate our efforts on those where the attacker can cause real damage.

     

    When we’re threat modeling, we should ensure that we’ve identified as many of the potential threats as possible (even if you think they’re trivial). At a minimum, the threats we list that we chose to ignore will remain in the document to provide guidance for the future. 

     

    Remember that the feature team can always decide that we’re ok with accepting the risk of a particular threat (subject to the SDL security review process). But we want to make sure that we mitigate the right issues.

    To help you guide your thinking about what kinds of threats deserve mitigation, here are some rules of thumb that you can use while performing your threat modeling.

    1. If the data hasn’t crossed a trust boundary, you don’t really care about it.

    2. If the threat requires that the attacker is ALREADY running code on the client at your privilege level, you don’t really care about it.

    3. If your code runs with any elevated privileges (even if your code runs in a restricted svchost instance) you need to be concerned.

    4. If your code invalidates assumptions made by other entities, you need to be concerned.

    5. If your code listens on the network, you need to be concerned.

    6. If your code retrieves information from the internet, you need to be concerned.

    7. If your code deals with data that came from a file, you need to be concerned (these last two are the inverses of rule #1).

    8. If your code is marked as safe for scripting or safe for initialization, you need to be REALLY concerned.

     

    Let’s take each of these in turn, because there are some subtle distinctions that need to be called out.

    If the data hasn’t crossed a trust boundary, you don’t really care about it.

    For example, consider the case where a hostile application passes bogus parameters into our API. In that case, the hostile application lives within the same trust boundary as the application, so you can simply certify the threat. The same thing applies to window messages that you receive. In general, it’s not useful to enumerate threats within a trust boundary. [Editors Note: Yesterday, David LeBlanc wrote an article about this very issue - I 100% agree with what he says there.] 

    But there’s a caveat (of course there’s a caveat, there’s ALWAYS a caveat). Just because your threat model diagram doesn't have a trust boundary on it, it doesn't mean that the data being validated hasn't crossed a trust boundary on the way to your code.

    Consider the case of an application that takes a file name from the network and passes that filename into your API. And further consider the case where your API has an input validation bug that causes a buffer overflow. In that case, it’s YOUR responsibility to fix the buffer overflow – an attacker can use the innocent application to exploit your code. Before you dismiss this issue as being unlikely, consider CVE-2007-3670. The Firefox web browser allows the user to execute scripts passed in on the command line, and registered a URI handler named “firefoxurl” with the OS with the start action being “firefox.exe %1” (this is a simplification). The attacker simply included a “firefoxurl:<javascript>” in a URL and was able to successfully take ownership of the client machine. In this case, the firefox browser assumed that there was no trust boundary between firefox.exe and the invoker, but it didn’t realize that it introduced such a trust boundary when it created the “firefoxurl” URI handler.

    If the threat requires that the attacker is ALREADY running code on the client at your privilege level, you don’t really care about it.

    For example, consider the case where a hostile application writes values into a registry key that’s read by your component. Writing those keys requires that there be some application currently running code on the client, which requires that the bad guy first be able to get code to run on the client box.

    While the threats associated with this are real, it’s not that big a problem and you can probably state that you aren’t concerned by those threats because they require that the bad guy run code on the box (see Immutable Law #1: “If a bad guy can persuade you to run his program on your computer, it’s not your computer anymore”).

    Please note that this item has a HUGE caveat: it ONLY applies if the attacker’s code is running at the same privilege level as your code. If that’s not the case, you have the next rule of thumb:

    If your code runs with any elevated privileges, you need to be concerned.

    We DO care about threats that cross privilege boundaries. That means that any data communication between an application and a service (which could be an RPC, it could be a registry value, it could be a shared memory region) must be included in the threat model.

    Even if you’re running in a low privilege service account, you still may be attacked – one of the privileges that all services get is the SE_IMPERSONATE_NAME privilege. This is actually one of the more dangerous privileges on the system because it can allow a patient attacker to take over the entire box. Ken “Skywing” Johnson wrote about this in a couple of posts on his blog (1 and 2) on his excellent blog Nynaeve. David LeBlanc has a subtly different take on this issue (see here), but the reality is that both David and Ken agree more than they disagree on this issue. If your code runs as a service, you MUST assume that you’re running with elevated privileges. This applies to all data read – rule #2 (requiring an attacker to run code) does not apply when you cross privilege levels, because the attacker could be writing code under a low privilege account to enable an elevation of privilege attack.

    In addition, if your component has a use scenario that involves running the component elevated, you also need to consider that in your threat modeling.

    If your code invalidates assumptions made by other entities, you need to be concerned

    The reason that the firefoxurl problem listed above was such a big deal was that the firefoxurl handler invalidated some of the assumptions made by the other components of Firefox. When the Firefox team threat modeled firefox, they made the assumption that Firefox would only be invoked in the context of the user.  As such it was totally reasonable to add support for executing scripts passed in on the command line (see rule of thumb #1).  However, when they threat modeled the firefoxurl: URI handler implementation, they didn’t consider that they had now introduced a trust boundary between the invoker of Firefox and the Firefox executable.  

    So you need to be aware of the assumptions of all of your related components and ensure that you’re not changing those assumptions. If you are, you need to ensure that your change doesn’t introduce issues.

    If your code retrieves information from the internet, you need to be concerned

    The internet is a totally untrusted resource (no duh). But this has profound consequences when threat modeling. All data received from the Internet MUST be treated as totally untrusted and must be subject to strict validation.

    If your code deals with data that came from a file, then you need to be concerned.

    In the previous section, I talked about data received over the internet. Microsoft has issued several bulletins this year that required an attacker tricking a user into downloading a specially crafted file over the internet; as a consequence, ANY file data must be treated as potentially malicious. For example, MS07-047 (a vulnerability in WMP) required that the attacker force the user to view a specially crafted WMP skin. The consequence of this is that that ANY file parsed by our code MUST be treated as coming from a lower level of trust.

    Every single file parser MUST treat its input as totally untrusted –MS07-047 is only one example of an MSRC vulnerability, there have been others. Any code that reads data from a file MUST validate the contents. It also means that we need to work to ensure that we have fuzzing in place to validate our mitigations.

    And the problem goes beyond file parsers directly. Any data that can possibly be read from a file cannot be trusted. <A senior developer in our division> brings up the example of a codec as a perfect example. The file parser parses the container and determines that the container isn't corrupted. It then extracts the format information and finds the appropriate codec for that format. The parser then loads the codec and hands the format information and file data to the codec.

    The only thing that the codec knows is that the format information that’s been passed in is valid. That’s it. Beyond the fact that the format information is of an appropriate size and has a verifiable type, the codec can make no assumptions about the contents of the format information, and it can make no assumptions about the file data. Even though the codec doesn’t explicitly parse the file, it’s still dealing with untrusted data read from the file.

    If your code is marked as “Safe For Scripting” or “Safe for Initialization”, you need to be REALLY concerned.

    If your code is marked as “Safe For Scripting” (or if your code can be invoked from a control that is marked as Safe For Scripting), it means that your code can be executed in the context of a web browser, and that in turn means that the bad guys are going to go after your code. There have been way too many MSRC bulletins about issues with ActiveX controls.

    Please note that some of the issues with ActiveX controls can be quite subtle. For instance, in MS02-032 we had to issue an MSRC fix because one of the APIs exposed by the WMP OCX returned a different error code if a path passed into the API was a file or if it was a directory – that constituted an Information Disclosure vulnerability and an attacker could use it to map out the contents of the users hard disk.

    In conclusion

    Vista raised the security bar for attackers significantly. As Vista adoption spreads, attackers will be forced to find new ways to exploit our code. That means that it’s more and more important to ensure that we do a good job ensuring that they have as few opportunities as possible to make life difficult for our customers.  The threat modeling process helps us understand the risks associated with our features and understand where we need to look for potential issues.

  • Larry Osterman's WebLog

    The Endian of Windows

    • 17 Comments

    Rick's got a great post on what big and little endian are, and what the Apple switch has to do with Word for the Mac.

    In the comments, Alicia asked about Windows...

    I tried to make this a comment on his blog but the server wouldn't take it.

     

    The answer is pretty simple: For Windows, there are parts of Windows that are endian-neutral (for instance, the CIFS protocol handlers and all of the DCE RPC protocol, etc), but the vast majority of Windows is little-endian.

    A decision was made VERY long ago that Windows would not be ported to a big-endian processor.  And as far as I can see, that's going to continue.  Since almost all the new processors coming out are either little-endian, or swing both ways (this is true of all the RISC machines Windows has supported, for example), this isn't really a big deal.

  • Larry Osterman's WebLog

    Compressible Encryption

    • 23 Comments

    Time to spread a smidge of dirt on Microsoft :).

    One of my favorite dialog boxes is found in Outlook.  If you dig deep enough into your email accounts, you'll find the following dialog box:

      Outlook Offline File Settings Dialog

    The reason I like this dialog box is the default setting "Compressible Encryption".  Why?  Because if you select it, you're not encrypting ANYTHING.  "Compressible Encryption" is really "compressed".  When this option is selected, the data in the OST in specified is compressed (I'm not sure of the algorithm).

    Calling a compressed OST file "encrypted" is sort of like saying that a ZIP file is an encrypted version of the file.  After all, if you look at the contents of the ZIP file, you'll not find any the information directly represents the original file (ok, the filenames might be in the archive uncompressed but that's about it).  But of course it's not encrypted.

    If you specify "High Encryption" then you get a truly encrypted OST file.  I'm not sure of the algorithms they use, but it IS really encrypted.

    So why on earth do they call it compressible encryption?  Well, I'm not 100% sure, but I suspect that the answer is that some executive decided to type their PST file (or OST file) and noticed that their email was found in clear text within the file.

    They also noticed that if they used compression on the PST file, then they weren't able to see the contents of the file.  So they equated compression with encryption (hey, they couldn't see the data, could they?).  And thus "compressible encryption" was born.

    It's really just a silly affectation - they should never have called it "encryption" because someone might think that the data's actually hidden, but...  If the dialog was being designed today (the actual dialogs over 10 years old), the term "encryption" would never be used but nowadays it's sort-of a historical oddity.

    If you do a search for "compressible encryption", the first Google and MSN search hit is Leo Notenboom's article on compressable encryption, here's the official KB article on compressible encryption.

    There are other examples where similar obfuscation has occurred, and I'm sure that other vendors have done similar things (or worse).  For example, the Exchange MAPI client-to-Exchange Server protocol is obfuscated because an executive noticed that if he took a network sniff of the traffic between his client and the Exchange server he could see his email messages going by.  So he asked the team to obfuscated the stream - we knew that it did nothing, and so did the executive, but as he pointed out, it's enough to protect from casual attackers.  If you really want encrypted communications, then if you specify the "Encrypt data between Microsoft Office Outlook and Microsoft Exchange Server" option in the Security tab of the Microsoft Exchange Server dialog, then that specifies RPC_C_AUTHN_LEVEL_PKT_PRIVACY, which uses a encryption mechanism to protect the data (I believe it's DES-56 but I'm not 100% sure).  I believe that this option is the default in all current versions of Outlook, but I'm not 100% sure.

  • Larry Osterman's WebLog

    Laptops and Kittens....

    • 38 Comments
    I mentioned the other day that we have four cats currently.  Three of them are 18 month old kittens (ok, at 18 months, they're not kittens anymore, but we still refer to them as "the kittens").

    A while ago, one of them (Aphus, we believe) discovered that if they batted at Valorie's laptop, they could remove the keys from the laptop, and the laptop keys made great "chase" toys.  Valorie has taken to locking her laptop up in a tiny computer nook upstairs as a result, but even with that, they somehow made off with her "L" key.  We've not been able to find it even after six months of looking.  To get her computer up and running, we replaced the "L" key with the "windows" key. Fortunately she's a touch typist, and thus never looks at her keyboard - when she does, she freaks out.

    Last night, I left a build running on my laptop when I went to bed.  Valorie mentioned that it would probably be a bad idea to do this, since the kittens were on the loose.

    Since I couldn't close the laptop without shutting down the build, I hit on what I thought was a great solution.  I put the laptop in two plastic bags, one on each side of the laptop (sorry about the mess on the table :)):

    I went to bed confident that I'd outsmarted the kittens.  My laptop would remain safe.

    Well, this morning, I got up, and went downstairs (you can see Sharron's breakfast cereal on the table to the top right).  I asked the kids if there had been any problems, and Daniel, with his almost-teenager attitude said "Yeah, the kittens scattered the keys on your laptop all over the kitchen".

    I  figured he was just twitting me, until I went to check on the computer...

    Oh crud...

    There were the keys, sitting in a pile where Sharron had collected them...

    I love my cats, I really do...

    The good news is that I managed to find all the keys, although I was worried about the F8 key for a while.

  • Larry Osterman's WebLog

    Why I removed the MSN desktop search bar from IE

    • 16 Comments

    I was really quite excited to see that the MSN Desktop Search Team had finally released the final version of their MSN Desktop Search toolbar.

    I've been using it for quite a while, and I've been really happy with it (except for the minor issue that the index takes up 220M of virtual memory, but that's just VA - the working set of the index is quite reasonable).

    So I immediately downloaded it and enabled the toolbar on IE.

    As often happens with toolbars, the toolbar was in the wrong place.  No big deal, I unlocked the toolbar and repositioned it to where I want it (immediately to the right of the button bar, where it takes up less real-estate).

    Then I locked the toolbar.  And watched as the MSN desktop search toolbar repositioned itself back where it was originally.

    I spent about 10 minutes trying to figure out a way of moving the desktop search bar next to the button bar, to no success.  By positioning it in the menu bar, I was able to get it to move into the button bar when I locked the toolbar, but it insisted on being positioned to the left of the button bar, not the right.

    Eventually I gave up.  I'm not willing to give up 1/4 inch of screen real-estate to an IE toolbar - it doesn't give me enough value to justify the real-estate hit.

    Sorry guys.  I'm still using the desktop search stuff (it's very, very cool), including the taskbar toolbar, but not the IE toolbar.  I hate it when my toolbars have a mind of their own.

    Update: Someone on the CLR team passed on a tip: The problem I was having is because I run as a limited user.  But it turns out that if you exit IE and restart it, the toolbar sticks where you put it!

    So the toolbar's back on my browser.

  • Larry Osterman's WebLog

    Building a flicker free volume control

    • 30 Comments

    When we shipped Windows Vista, one of the really annoying UI annoyances with the volume control was that whenever you resized it, it would flicker. 

    To be more specific, the right side of the control would flicker – the rest didn’t flicker (which was rather strange).

     

    Between the Win7 PDC release (what we called M3 internally) and the Win7 Beta, I decided to bit the bullet and see if I could fix the flicker.  It seems like I tried everything to make the flickering go away but I wasn’t able to do it until I ran into the WM_PRINTCLIENT message which allowed me to direct all of the internal controls on the window to paint themselves.

    Basically on a paint call, I’d take the paint DC and send a WM_PRINTCLIENT message to each of the controls in sndvol asking them each to paint themselves to the new DC.  This worked almost perfectly – I was finally able to build a flicker free version of the UI.  The UI wasn’t perfect (for instance the animations that faded in the “flat buttons” didn’t fire) but the UI worked just fine and looked great so I was happy that' I’d finally nailed the problem.  That happiness lasted until I got a bug report in that I simply couldn’t figure out.  It seems that if you launched the volume mixer, set the focus to another application then selected the volume mixer’s title bar and moved the mixer, there were a ton of drawing artifacts left on the screen.

    I dug into it a bunch and was stumped.  It appeared that the clipping rectangle sent in the WM_PAINT message to the top level message didn’t include the entire window, thus portions of the window weren’t erased.  I worked on this for a couple of days trying to figure out what was going wrong and I finally asked for help on one of our internal mailing lists.

    The first response I got was that I shouldn’t use WM_PRINTCLIENT because it was going to cause me difficulty.  I’d already come to that conclusion – by trying to control every aspect of the drawing experience for my app, I was essentially working against the window manager – that’s why the repaint problem was happening.  By calling WM_PRINTCLIENT I was essentially putting a band-aid on the real problem but I hadn’t solved the real problem, all I’d done is to hide it.

     

    So I had to go back to the drawing board.  Eventually (with the help of one of the developers on the User team) I finally tracked down the original root cause of the problem and it turns out that the root cause was somewhere totally unexpected.

    Consider the volume UI:

    image

    The UI is composed of two major areas: The “Devices” group and the “Applications” group.  There’s a group box control wrapped around the two areas.

    Now lets look at the group box control.  For reasons that are buried deep in the early history of Windows, a group box is actually a form of the “button” control.  If you look at the window styles for a button in SpyXX, you’ll see:

    image

     

    Notice the CS_VREDRAW and CS_HREDRAW window class styles.  The MSDN documentation for class styles says:

    CS_HREDRAW - Redraws the entire window if a movement or size adjustment changes the width of the client area.
    CS_VREDRAW - Redraws the entire window if a movement or size adjustment changes the height of the client area.

    In other words every window class with the CS_HREDRAW or CS_VREDRAW style will always be fully repainted whenever the window is resized (including all the controls inside the window).  And ALL buttons have these styles.  That means that whenever you resize any buttons, they’re going to flicker, and so will all of the content that lives below the button.  For most buttons this isn’t a big deal but for group boxes it can be a big issue because group boxes contain other controls.

    In the case of sndvol, when you resize the volume control, we resize the applications group box (because it’s visually pinned to the right side of the dialog).  Which causes the group box and all of its contained controls to repaint and thus flicker like crazy.  The only way to fix this is to remove the CS_HREDRAW and CS_VREDRAW buttons from the window style for the control.

    The good news is that once I’d identified the root cause, the solution to my problem was relatively simple.  I needed to build my own custom version of the group box which handled its own painting and didn’t have the CS_HREDRAW and CS_VREDRAW class.  Fortunately it’s really easy to draw a group box – if themes are enabled a group box can be drawn with DrawThemeBackground API with the BP_GROUPBOX part and if theming is disabled, you can use the DrawEdge API to draw the group box.  Once I added the new control that and dealt with a number of other clean-up issues (making sure that the right portions of the window were invalidated when the window was resized for example), making sure that my top level control had the WS_CLIPCHILDREN style and that each of the sub windows had the WS_CLIPSIBLINGS style I had a version of sndvol that was flicker free AND which let the window manager handle all the drawing complexity.  There are still some minor visual gotchas in the UI (for example, if you resize the window using the left edge the right side of the group box “shudders” a bit – this is apparently an artifact that’s outside my control – other apps have similar issues when resized on the left edge) but they’re acceptable.

    As an added bonus, now that I was no longer painting everything manually, the fade-in animations on the flat buttons started working again!

     

    PS: While I was writing this post, I ran into this tutorial on building flicker free applications, I wish I’d run into it while I was trying to deal with the flickering problem because it nicely lays out how to solve the problem.

  • Larry Osterman's WebLog

    UUIDs are only unique if you generate them...

    • 28 Comments

    We had an internal discussion recently and the upshot of the discussion was that it turns out that some distributed component on the web appears to have used the UUID of a sample COM component.

    Sigh.

    I wonder sometimes why people do this.  It's not like it's hard to run uuidgen and then copy the relevent GUIDs to your RGS file (and/or IDL file, or however it is you're defining and registering your class).

    I guess the developers of the distributed component figured that they didn't have to follow the rules because everyone else was going to follow them.

    And, no, I don't know what component it was, or why they decided to copy the sample.

    So here's a good rule of thumb.  When you're designing a COM component, you should probably use UUIDGEN (or UuidCreate()) to generate unique (and separate) GUIDS for the Interface ID, Class ID, and Library ID and App ID.

     

  • Larry Osterman's WebLog

    Tipping Points

    • 84 Comments

    One of my birthday presents was the book "The Tipping Point" by Malcolm Gladwell.

    In it, he talks about how epidemics and other flash occurances happen - situations that are stable, and a small thing changes and suddenly the world changed overnight.

    I've been thinking a lot about yesterdays blog post, and I realized that not only is it a story about one of the coolest developers I've ever met, it also describes a tipping point for the entire computer industry.

    Sometimes, it's fun to play the "what if" game, so...

    What if David Weise hadn't gotten Windows applications running in protected mode?  Now, keep in mind, this is just my rampant speculation, not what would have happened.  Think of it kinda like the Marvel Comics "What if..." series (What would have happened if Spiderman had rescued Gwen Stacy, etc [note: the deep link may not work, you may have to navigate directly]).

    "What If David Weise hadn't gotten Windows applications running in protected mode..."[1]

    Well, if Windows 3.0 hadn't had windows apps running in protected mode, then it likely would have not been successful.  That means that instead of revitalizing interest in Microsoft in the MS-DOS series of operating systems, Microsoft would have continued working on OS/2.  Even though working under the JDA was painful for both Microsoft and IBM, it was the best game in town.

    By 1993, Microsoft and IBM would have debuted OS/2 2.0, which would have had supported 32bit applications, and had MVDM support built-in.

    Somewhere over the next couple of years, the Windows NT kernel would have come out as the bigger, more secure brother of OS/2, it would have kept the workplace shell that IBM wrote (instead of the Windows 3.1 Task Manager).

    Windows 95 would have never existed, since the MS-DOS line would have withered and died off.  Instead, OS/2 would be the 32bit application for lower end machines.  And instead of Microsoft driving the UI story for the platform, IBM would have owned it.

    By 2001, most PC class machines would have OS/2 running on them (probably OS/2 2.5) with multimedia support.  NT OS/2 would also be available for business and office class machines.  With IBMs guidance, instead of the PCI bus becoming dominant, the MCA was the dominant bus form factor.  The nickname for the PC architecture wasn't "Wintel", instead it was "Intos" (OS2tel was just too awkwards to say).  IBM, Microsoft and Intel all worked to drive the hardware platform, and, since IBM was the biggest vendor of PC class hardware, they had a lot to say in the decisions.

    And interestingly enough, when IBM came to the realization that they could make more money selling consulting services than selling hardware, instead of moving to Linux, they stuck with OS/2 - they had a significant ownership stake in the platform, and they'd be pushing it as hard as they can.

    From Microsoft's perspective, the big change would be that instead of Microsoft driving the industry, IBM (as Microsoft's largest OEM, and development partner in OS/2) would be the driving force (at least as far as consumers were concerned).  UI decisions would be made by IBM's engineers, not Microsoft's.

    In my mind, the biggest effect of such a change would be on Linux.  Deprived of the sponsorship of a major enterprise vendor (the other enterprise players followed IBMs lead and went with OS/2), Linux remained as primarily an 'interesting' alternative to Solaris, AIX, and the other *nix based operating systems sold by hardware vendors.  Instead, AIX and Solaris became the major players in the *nix OS space, and flourished as an alternative. 

     

    Anyway, it's all just silly speculation, about what might have happened if the industry hadn't tipped, so take it all with a healthy pinch of salt.

    [1] I'm assuming that all other aspects of the industry remain the same: The internet tidal wave hit in the mid 90s, computers remained as fast as they had always, etc. - this may not be a valid set of assumptions, but it's my fantasy.  I'm also not touching on what affects the DoJ would have had on the situation.

  • Larry Osterman's WebLog

    AARDvarks in your code.

    • 29 Comments

    If there was ever a question that I’m a glutton for punishment, this post should prove it.

    We were having an email discussion the other day, and someone asked:

    Isn't there a similar story about how DOS would crash when used with [some non-MS thing] and only worked with [some MS thing]? I don't remember what the "thing" was though =)

    Well, the only case I could think of where that was the case was the old AARD code in Windows.  Andrew Schulman wrote a great article on it back in the early 1990’s, which dissected the code pretty thoroughly.

    The AARD code in Windows was code to detect when Windows was running on a cloned version of MS-DOS, and to disable Windows on that cloned operating system.  By the time that Windows 3.1 shipped, it had been pulled from Windows, but the vestiges of the code were left behind.  As Andrew points out, the code was obfuscated, and had debugger-hiding logic, but it could be reverse engineered, and Andrew did a great job of doing it.

    I can’t speak as to why the AARD code was obfuscated, I have no explanation for that, it seems totally stupid to me.  But I’ve got to say that I totally agree with the basic concept of Windows checking for an alternative version of MS-DOS and refusing to run on it.

    The thing is that the Windows team had a problem to solve, and they didn’t care how they solved it.  Windows decided that it owned every part of the system, including the internal data structures of the operating system.  It knew where those structures were located, it knew what the size of those data structures was, and it had no compunction against replacing those internal structures with its own version.  Needless to say, from a DOS developer’s standpoint, keeping Windows working was an absolute nightmare.

    As a simple example, when Windows started up, it increased the size of MS-DOS’s internal file table (the SFT, that’s the table that was created by the FILES= line in config.sys).  It did that to allow more than 20 files to be opened on the windows system (a highly desirable goal for a multi-tasking operating system).  But it did that by using an undocumented API call, which returned a pointer to a set of “interesting” pointers in MS-DOS. It then indexed a known offset relative to that pointer, and replaced the value of the master SFT table with its own version of the SFT.  When I was working on MS-DOS 4.0, we needed to support Windows.  Well, it was relatively easy to guarantee that our SFT was at the location that Windows was expecting.  But the problem was that the MS-DOS 4.0 SFT was 2 bytes larger than the MS-DOS 3.1 SFT.   In order to get Windows to work, I had to change the DOS loader to detect when win.com was being loaded, and if it was being loaded, I looked at the code at an offset relative to the base code segment, and if it was a “MOV” instruction, and the amount being moved was the old size of the SFT, I patched the instruction in memory to reflect the new size of the SFT!  Yup, MS-DOS 4.0 patched the running windows binary to make sure Windows would still continue to work.

    Now then, considering how sleazy Windows was about MS-DOS, think about what would happen if Windows ran on a clone of MS-DOS.  It’s already groveling internal MS-DOS data structures.  It’s making assumptions about how our internal functions work, when it’s safe to call them (and which ones are reentrant and which are not).  It’s assuming all SORTS of things about the way that MS-DOS’s code works.

    And now we’re going to run it on a clone operating system.  Which is different code.  It’s a totally unrelated code base.

    If the clone operating system isn’t a PERFECT clone of MS-DOS (not a good clone, a perfect clone), then Windows is going to fail in mysterious and magical ways.  Your app might lose data.  Windows might corrupt the hard disk.   

    Given the degree with which Windows performed extreme brain surgery on the innards of MS-DOS, it’s not unreasonable for Windows to check that it was operating on the correct patient.

     

    Edit: Given that most people aren't going to click on the link to the Schulman article, it makes sense to describe what the AARD check was :)

    Edit: Fixed typo, thanks KC

  • Larry Osterman's WebLog

    How do you know what a particular error code means?

    • 16 Comments

    So you're debugging your program, and all of a sudden you get this wierd error code - say error 0x00000011.  How do you know what that message means?

    Well, one way is to memorize the entire Win32 error return code set, but that's got some issues.

    Another way, if you have the right debugger extension is to use the !error extension - it will return the error text associated with the error.  There's a similar trick for dev studio (although I'm not sure what it is since I don't use the devstudio debugger)

    But sometimes you're not running under windbg or devstudio and you've got a Win32 error code to look up.

    And here's where the clever trick comes in.  You see, there's a complete list of error codes built into the system.  It's buried in the NET.EXE command that's used for network administration.

    If you type "NET HELPMSG <errorno>" on the command line, you'll get a human readable version of the error code. 

    So:

    C:\>net helpmsg 17
    The system cannot move the file to a different disk drive.

    It's a silly little trick, but I've found it extraordinarily useful.

     

  • Larry Osterman's WebLog

    Hey, why am I leaking all my BSTR's?

    • 12 Comments

    IMHO, every developer should have a recent copy of the debugging tools for windows package installed on their machine (it's updated regularly, so check to see if there's a newer version).

    One of the most useful leak tracking tools around is a wonderfully cool tool that's included in this package, UMDH.  UMDH allows you to take a snapshot of the heaps in a process, and perform a diff of the heap over time - basically you run it once to take a snapshot, then run it a second time after running a particular test and it allows you to compare the differences in the heaps.

    This tool can be unbelievably useful when debugging services, especially shared services.  The nice thing about it is that it provides a snapshot of the heap usage, there are often times when that's the only way to determine the cause of a memory leak.

    As a simple example of this the Exchange 5.5 IMAP server cached user logons.  It did this for performance reasons, it could take up to five seconds for a call to LogonUser to complete, and that affected our ability to service large numbers of clients - all of the server threads ended up being blocked waiting on the domain controllers to respond.  So we put in a logon cache.  The cache took the users credentials, performed a LogonUser with those credentials, and put the results into a heap.  On subsequent logons, the cache took the users credentials, looked them up in the heap, and if they were found, it just reused the token from the cache (and no, it didn't do the lookup in clear text, I'm not that stupid).  Unfortunately, when I first wrote the cache implementation, I had an uninitialized variable in the hash function used to lookup the user in the cache, and as a result, every user logon occupied a different slot in the hash table.  As a result, when run over time, I had a hideous memory leak (hundreds of megabytes of VM).  But, since the cache was purged on exit, the built-in leak tracking logic in the Exchange store didn't detect any memory leaks. 

    We didn't have UMDH at the time, but UMDH would have been a perfect solution to the problem.

    I recently went on a tear trying to find memory leaks in some of the new functionality we've added to the Windows Audio Service, and used UMDH to try to catch them.

    I found a bunch of the leaks, and fixed them, but one of the leaks I just couldn't figure out showed up every time we allocated a BSTR object.

    It drove me up the wall trying to figure out how we were leaking BSTR objects, nothing I did found the silly things.  A bunch of the leaks were in objects allocated with CComBSTR, which really surprised me, since I couldn't see how on earth they would leak memory.

    And then someone pointed me to this KB article (KB139071).  KB1239071 describes the OLE caching of BSTR objects.  It also turns out that this behavior is described right on the MSDN page for the string manipulation functions, proving once again that I should have looked at the documentation :).

    Basically, OLE caches all BSTR objects allocated in a process to allow it to pool together strings.  As a result, these strings are effectively leaked “on purposeâ€.  The KB article indicates that the cache is cleared when the OLEAUT32.DLL's DLL_PROCESS_DETACH logic is run, which is good to know, but didn't help me to debug my BSTR leak - I could still be leaking BSTRs.

    Fortunately, there's a way of disabling the BSTR caching, simply set the OANOCACHE environment variable to 1 before launching your application.  If your application is a service, then you need to set OANOCACHE as a system environment variable (the bottom set of environment variables) and reboot.

    I did this and all of my memory leaks mysteriously vanished.  And there was much rejoicing.

     

  • Larry Osterman's WebLog

    What's wrong with this code, part lucky 13

    • 35 Comments
    Today's example is a smidge long, I've stripped out everything I can possibly imagine stripping out to reduce size.

    This is a very real world example that we recently hit - only the names have been changed to protect the innocent.

    I've used the built-in C++ decorations for interfaces, but that was just to get this stuff to compile in a single source file, it's not related to the bug.

    extern CLSID CLSID_FooDerived;
    [
        object,
        uuid("0A0DDEDC-C422-4BB3-9869-4FED020B66C5"),
    ]
    __interface IFooBase : IUnknown
    {
        HRESULT FooBase();
    };

    class CFooBase: public IFooBase
    {
        LONG _refCount;
        virtual ~CFooBase()
        {
            ASSERT(_refCount == 0);
        };
    public:
        CFooBase() : _refCount(1) {};
        virtual HRESULT STDMETHODCALLTYPE QueryInterface(const IID& iid, void** ppUnk)
        {
            HRESULT hr=S_OK;
            *ppUnk = NULL;
            if (iid == IID_FooBase)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(this);
            }
            else if (iid == IID_IUnknown)
            {
                AddRef();
                *ppUnk = reinterpret_cast<void *>(this);
            }
            else
            {
                hr = E_NOINTERFACE;
            }
            return hr;
        }
        virtual ULONG STDMETHODCALLTYPE AddRef(void)
        {
            return InterlockedIncrement(&_refCount);
        }
        virtual ULONG STDMETHODCALLTYPE Release(void)
        {
            LONG refCount;
            refCount = InterlockedDecrement(&_refCount);
            if (refCount == 0)
            {
                delete this;
            }
            return refCount;

        }
        STDMETHOD(FooBase)(void);
    };
    class ATL_NO_VTABLE CFooDerived :
        public CComObjectRootEx<CComMultiThreadModel>,
        public CComCoClass<CFooDerived, &CLSID_FooDerived>,
        public CFooBase
    {
        virtual ~CFooDerived();
        public:
        CFooDerived();
        DECLARE_NO_REGISTRY()
        BEGIN_COM_MAP(CFooDerived)
            COM_INTERFACE_ENTRY(IFooBase)
        END_COM_MAP()
        DECLARE_PROTECT_FINAL_CONSTRUCT()

    };

    OBJECT_ENTRY_AUTO(CLSID_FooDerived, CFooDerived)

     

    As always, tomorrow I'll post the answers along with kudos and mea culpas.

    Edit: Fixed missing return value in Release() - without it it doesn't compile.  Also added the addrefs - my stupid mistake.  mirobin gets major props for those ones.

  • Larry Osterman's WebLog

    Where's Larry been?

    • 10 Comments

    Ok, 5+ weeks and no posts, what gives?

     

    Basically, after Vista shipped, I took a much needed vacation.  I was looking at losing 2 weeks of vacation at the end of the year and I decided to spend some of the time I would have carried over and just take off until the new year. 

    It's funny.  I used to ask my friends who retired early what they did with all that time, and they invariably answered: "It's not a problem at all - there's tons of stuff to do".  Ya know, they were right - there IS tons of stuff to do.

    I mostly spent my time off wrapping Christmas presents (my side of the family's Christmas was 20 people this year so there are a lot of presents to wrap (it'll be 24 people next year since my brother's family joins us on odd years), adding in Valorie's family and friends here in Seattle means we do presents for something like 40 people).  I also spent time schlepping the kids to various events and shopping and generally just puttering around.  It was really quite relaxing (and quite the change to how my life was before we shipped).

     

    And then we had the windstorm.  On the night of the storm, Valorie and I sat on our front porch and watched the trees across the street bend back and forth.  It was truly freaky - there was essentially no wind at street level, while the trees across the street were bending at least 20 degrees.  I was convinced they were going to snap.

    We lost power at about 6PM on Thursday the 14th.  What we didn't realize until later was that we lived at essentially ground zero for the windstorm powerline damage - while there was no damage at all in our neighborhood, the same couldn't be said about the power lines around our house.  On Friday morning, we left the house to get supplies (always after the fact) and drove to Woodinville-Duvall road which was simply devastated - trees across the road held up only by power lines, power poles snapped in half, traffic and street lights lying twisted in the middle of the road.

    While power was out, we never bothered with the shelter thing, we just ate out at restaurants and drove around taking care of last minute things (silly stuff like dealing with the load of laundry that was running in the washing machine when power went out).  Fortunately power came back at Microsoft on Saturday afternoon, so we were able to go and get showered in the locker rooms (thanks Lisa for the towels).  On Sunday night there were a half a dozen families scattered around my building who had decided to camp out for the duration.

    We left for the east coast on Monday the 18th still without power, the power didn't come back at home until sometime on the 22nd (I called into home daily and when the answering machine picked up, I knew we had power). 

     

    Christmas itself was very fun, it was great seeing everyone in our family again - we spent the first week with my mom in NYC (we saw Spamalot and the Big Apple Circus (a holiday tradition, this year's show was particularly impressive)).

    We next drove up to Boston and celebrated Christmas with the rest of the family, which was again great - lots of great food, conversations, etc.  Gift highlights for Christmas were: Sharron's getting her ears pierced, and Daniel receiving the most remarkable pair of Chuck Taylor Converse All Stars I've ever seen (I didn't even know they MADE shoes in metallic gold).  Here's a link to the shoes.

    And of course getting to see all the "littles" again was really cool - my cousins have 5 kids all under 6 between them so there are tons of really cute kids running around all the time. 

    We then returned to Albany for a couple of days with my immediate family - always a lot of fun, including a birthday dinner hosted by my sister Edie (we had chicken puff (an old family recipe ) and a Carvel cake (we don't get them on the West Coast)) :).

     

    And after all that, we then returned to Seattle and started dealing with the aftermath of the storm - everything in our freezers is toast, so we're restocking the house. 

    Oh, and now that I'm back at work, I'm dealing with six weeks of email backlog.  That's also "exciting", but in a different way.

    Coming up?  Well, Daniel's been cast as Demetrious in the Village Theatre's Kidstage production of "A Midsummer Night's Dream", so he's going to be in rehearsals essentially non-stop for the next 5 weeks, then he is in the ensemble for the Overlake School's production of "The Robber Bridegroom".  So we're gonna be doing a ton of schlepping :).

  • Larry Osterman's WebLog

    Some final thoughts on Threat Modeling...

    • 16 Comments

    I want to wrap up the threat modeling posts with a summary and some comments on the entire process.  Yeah, I know I should have done this last week, but I got distracted :). 

    First, a summary of the threat modeling posts:

    Part 1: Threat Modeling, Once again.  In which our narrator introduces the idea of a threat model diagram

    Part 2: Threat Modeling Again. Drawing the Diagram.  In which our narrator introduces the diagram for the PlaySound API

    Part 3: Threat Modeling Again, Stride.  Introducing the various STRIDE categories.

    Part 4: Threat Modeling Again, Stride Mitigations.  Discussing various mitigations for the STRIDE categories.

    Part 5: Threat Modeling Again, What does STRIDE have to do with threat modeling?  The relationship between STRIDE and diagram elements.

    Part 6: Threat Modeling Again, STRIDE per Element.  In which the concept of STRIDE/Element is discussed.

    Part 7: Threat Modeling Again, Threat Modeling PlaySound.  Which enumerates the threats against the PlaySound API.

    Part 8: Threat Modeling Again, Analyzing the threats to PlaySound.  In which the threat modeling analysis work against the threats to PlaySound is performed.

    Part 9: Threat Modeling Again, Pulling the threat model together.  Which describes the narrative structure of a threat model.

    Part 10: Threat Modeling Again, Presenting the PlaySound threat model.  Which doesn't need a pithy summary, because the title describes what it is.

    Part 11: Threat Modeling Again, Threat Modeling in Practice.  Presenting the threat model diagrams for a real-world security problem .[1]

    Part 12: Threat Modeling Again, Threat Modeling and the firefoxurl issue. Analyzing the real-world problem from the standpoint of threat modeling.

    Part 13: Threat Modeling Again, Threat Modeling Rules of Thumb.  A document with some useful rules of thumb to consider when threat modeling.

     

    Remember that threat modeling is an analysis tool. You threat model to identify threats to your component, which then lets you know where you need to concentrate your resources.  Maybe you need to encrypt a particular data channel to protect it from snooping.  Maybe you need to change the ACLs on a data store to ensure that an attacker can't modify the contents of the store.  Maybe you just need to carefully validate the contents of the store before you read it.  The threat modeling process tells you where to look and gives you suggestions about what to look for, but it doesn't solve the problem.  It might be that the only thing that comes out from your threat modeling process is a document that says "We don't care about any of the threats to this component".  That's ok, at a minimum, it means that you considered the threats and decided that they were acceptable.

    The threat modeling process is also a living process. I'm 100% certain that 2 years from now, we're going to be doing threat modeling differently from the way that we do it today.  Experience has shown that every time we apply threat modeling to a product, we realize new things about the process of performing threat modeling, and find new, more efficient ways of going about the process.   Even now, the various teams involved with threat modeling in my division have proposed new changes the process based on the experiences of our current round of threat modeling.  Some of them will be adopted as best practices across Microsoft, some of them will be dropped on the floor. 

     

    What I've described over these posts is the process of threat modeling as it's done today in the Windows division at Microsoft.  Other divisions use threat modeling differently - the threat landscape for Windows is different from the threat landscape for SQL Server and Exchange, which is different from the threat landscape for the various Live products, and it's entirely different for our internal IT processes.  All of these groups use threat modeling, and they use the core mechanisms in similar ways, but because each group that does threat modeling has different threats and different risks, the process plays out differently for each team.

    If your team decides to adopt threat modeling, you need to consider how it applies to your components and adopt the process accordingly.  Threat Modeling is absolutely not a one-size-fits-all process, but it IS an invaluable tool.

     

    EDIT TO ADD: Adam Shostak on the Threat Modeling Team at Microsoft pointed out that the threat modeling team has a developer position open.  You can find more information about the position by going to here:  http://members.microsoft.com/careers/search/default.aspx and searching for job #207443.

    [1] Someone posting a comment on Bruce Schneier's blog took me for task for using a browser vulnerability.  I chose that particular vulnerability because it was the first that came to mind.  I could have just as easily picked the DMG loading logic in OSX or the .ANI file code in Windows for examples (actually the DMG file issues are in several ways far more interesting than the firefoxurl issue - the .ANI file issue is actually relatively boring from a threat modeling standpoint).

  • Larry Osterman's WebLog

    COM registration if you need a typelib

    • 8 Comments
    The problem with the previous examples I posted on minimal COM object registration is that they don't always work.  As I mentioned, if you follow the rules specified, while your COM object will work just fine from Win32 applications, you'll have problems if you attempt to access it from a managed environment (either an app running under the CLR or another management environment such as the VB6 runtime or the scripting host).

    For those environments, you need to have a typelib.  Since typelib's were designed primarily for interoperating with visual basic, they don't provide full access to the functionality that's available via MIDL (for instance, unnamed unions get turned into named unions, the MIDL boolean type isn't supported, etc), but if you gotta interoperate, you gotta interoperate.

    So you've followed the examples listed here and you've registered your COM object, now how do you hook it up to the system?

    First, you could call the RegisterTypeLib function, which will perform the registration, but that would be cheating :)  More importantly, there are lots of situations where it's inappropriate to use RegisterTypeLib - for instance, if you're building an app that needs to be installed, you need to enumerate all the registry manipulations done by your application so they can be undone.

    So if you want to register a typelib, it's a smidge more complicated than registering a COM component or interface.

    To register a typelib, you need (from here):

    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\
    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\<major version>.<minor version>\   
        Default Value: <friendly name for the library> Again, not really required, but nice for oleview
    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\<major version>.<minor version>\HELPDIR   
        Default Value: <Directory that contains the help file for the type library>
    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\<major version>.<minor version>\FLAGS   
        Default Value: Flags for the ICreateTypeLib::SetLibFlags call (typically 0)
    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\<major version>.<minor version>\<LCID for library>
    Key: HKEY_CLASSES_ROOT\Typelib\<LibID>\<major version>.<minor version>\<LCID>\<Platform>
        Default Value: <File name that contains the typelib>

    Notes:

    If your typelib isn't locale-specific, you can specify 0 for the LCID.  Looking at my system, that's typically what most apps do.

    <Platform> can be win32, win64 or win16 depending on the platform of the binary.
     

    But this isn't quite enough to get the typelib hooked up  - the system still doesn't know how to get access to the type library.  To do that, you need to enhance your CLSID registration to let COM know that there's a typelib available.  With the typelib, a managed environment can synthesize all the interfaces associated with a class.  To do that, you enhance the class registration:

    Key: HKEY_CLASSES_ROOT\CLSID\<CLSID>\TypeLib = <LibID>

    But we're still not quite done.  For each of the interfaces in the typelib, you can let the system do the marshaling of the interface for you without having to specify a proxy library.  To do this, you can let the standard proxy marshaler do the work.  The universal marshaler has a clsid of {00020424-0000-0000-C000-000000000046}, so instead of using the interface registration mentioned in the last article, you can replace it with:

    Key: HKEY_CLASSES_ROOT\Interface\<IID>\
        Default Value: <friendly name for the interface> Again, not really required, but nice for oleview
    Key: HKEY_CLASSES_ROOT\Interface\<IID>\ProxyStubClsid32\
        Default Value: {00020424-0000-0000-C000-000000000046}
    Key: HKEY_CLASSES_ROOT\Interface\<IID>\TypeLib\
        Default Value: <LibID>

    Now instead of using the proxy code in a proxy DLL, the system will do the marshaling for you.

    Next: Ok, but what if I don't want to deal with all those ugly GUID thingies?

  • Larry Osterman's WebLog

    What is a BUGBUG?

    • 27 Comments
    One of the internal software engineering traditions here at Microsoft is the "BUGBUG".

    Bugbug's are annotations that are added to the source code when the developer writing the code isn't sure if the code they're writing is "correct", or if there's some potential issue with the code that the developer feels needs further investigation.

    So when looking through source code, you sometimes find things like:

        // BUGBUG: I'm sure these GUIDs are defined somewhere but I'm not sure which library contains them, so defining them here.
        DEFINE_GUID(IID_IFoo, 0x12345678,0x1234,0x1234,0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12);
     

    The idea behind a BUGBUG annotation is that a BUGBUG is something that you should fix before you ship, but that won't necessarily hold shipping the product for - as in the example above, it's not the end of the world if the definition of IFoo is duplicated in this module, but it IS somewhat sloppy.  Typically every component has a P1 bug in the database to remove all the BUGBUG's - either turn them into real bugs, or remove them, or ensure that unit tests exist to verify (or falsify) the bugbug.

    As far as I know, the concept of a BUGBUG's was initially created by Alan Whitney, who was my first manager at Microsoft - I know he's the first person who explained their use to me.  Lately they've fallen out of favor in favor of more structured constructs, but conceptually, I still like them.

  • Larry Osterman's WebLog

    What's wrong with this code, part 8 - Email Address Validation

    • 23 Comments

    It's time for another "What's wrong with this code".

    Today's example is really simple, and hopefully easy.  It's a snippet of code I picked up from the net that's intended to validate an email address (useful for helping to avoid SQL injection attacks, for example).

     

        /// <summary>
        /// Validate an email address provided by the caller.
        ///
        /// Taken from http://www.codeproject.com/aspnet/Valid_Email_Addresses.asp
        /// </summary>
        public static bool ValidateEmailAddress(string emailAddress)
        {
            string strRegex = @"^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}" +
                              @"\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\" +
                              @".)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$";
            System.Text.RegularExpressions.Regex re = new System.Text.RegularExpressions.Regex(strRegex);
            if (re.IsMatch(emailAddress))
                return (true);
            else
            return (false);
        }

    As always, my next post (Monday) will include the answers and kudos to all those who got it right (and who f

  • Larry Osterman's WebLog

    Concurrency, Part 10 - How do you know if you've got a scalability issue?

    • 21 Comments
    Well, the concurrency series is finally running down (phew, it's a lot longer than I expected it to be)...

    Today's article is about determining how you know if you've got a scalability problem.

    First, a general principle: All non trivial, long lived applications have scalability problems.  It's possible that the scalability issues don't matter to your application.  For example, if you application is Microsoft Word (or mIRC, or Firefox, or just about any other application that interacts with the user)) then scalability isn't likely to be an issue for your application - the reality is that the user isn't going to try to make your application faster by throwing more resources at the application.

    As I write wrote the previous paragraph, I just realized that it describes the heart of scalability issues - if the user of your application feels it's necessary to throw more resources at your application, then your application needs to have to worry about scalability.  It doesn't matter if the resources being thrown at your application are disk drives, memory, CPUs, GPUs, blades, or entire computers, if the user decides that hat your system is bottlenecked on a resource, they're going to try to throw more of that resource at your application to make it run faster.  And that means that your application needs to be prepared to handle it.

    Normally, these issues are only for server applications living in data farms, but we're starting to see the "throw more hardware at it" idea trickle down into the home space.  As usual, the gaming community is leading the way - the AlienWare SLI machines are a great example of this - to improve your 3d graphics performance, simply throw more GPUs at the problem.

    I'm not going to go into diagnosing bottlenecks in general, there are loads of resources available on the web for it (my first Google hit on Microsoft.com was this web cast from 2003).

    But for diagnosing CPU bottlenecks related to concurrency issues, there's actually a relatively straightforward way of determining if you've got a scalability issue associated with your locks.  And that's to look at the "Context Switches/sec" perfmon counter.  There's an article on how to measure this in the Windows 2000 resource kit here, so I won't go into the details, but in a nutshell, you start the perfmon application, select all the threads in your application, and look at the context switches/sec for each thread.

    You've got a scalability problem related to your locks if the context switches/second is somewhere above 2000 or so.

    And that means you need to dig into your code to find the "hot" critical sections.  The good news is that it's not usually to hard to detect which critical section is "hot" - hook a debugger up to your application, start your stress and put a breakpoint in the ntdll!RtlEnterCriticalSection routine.  You'll get a crazy number of hits, but if you look at your call stacks, then the "hot" critical will start to show up.  It sounds tedious (and it is somewhat) but it is surprisingly effective.   There are other techniques for detecting the "hot" critical sections in your process but they are not guaranteed to work on all releases on Windows (and will make Raymond Chen very, very upset if you use them).

    Sometimes, your CPU bottleneck is simply that you're doing too much work on a single thread - if it simply takes too much time to calculate something, then you need to start seeing if it's possible to parallelize your code - you're back in the realm of making your code go faster and out of the realm of concurrent programming.  Another option that you might have is the OpenMP language extensions for C and C++ that allow the compiler to start parallelizing your code for you.

    But even if you do all that and ensure that your code is bottleneck free, you still can have scalability issues.  That's for tomorrow.

    Edit: Fixed spelling mistakes.

     

  • Larry Osterman's WebLog

    Why should I even bother to use DLL's in my system?

    • 9 Comments

    At the end of this blog entry, I mentioned that when I drop a new version of winmm.dll on my machine, I need to reboot it.  Cesar Eduardo Barros asked:

    Why do you have to reboot? Can't you just reopen the application that's using the dll, or restart the service that's using it?

    It turns out that in my case, it’s because winmm’s listed in the “Known DLLs” for Longhorn.  And Windows treats “KnownDLLs” as special – if a DLL is a “KnownDLL” then it’s assumed to be used by lots of processes, and it’s not reloaded from the disk when a new process is created – instead the pages from the existing DLL is just remapped into the current process.

    But that and a discussion on an internal alias got me to thinking about DLL’s in general.  This also came up during my previous discussion about the DLL C runtime library.

    At some point in the life of a system, you decide that you’ve got a bunch of code that’s being used in common between the various programs that make up the system. 

    Maybe that code’s only used in a single app – one app, 50 instances.

    Maybe that code’s used in 50 different apps – 50 apps, one instance.

    In the first case, it really doesn’t matter if you refactor the code into a separate library or not.  You’ll get code sharing regardless.

    In the second case, however, you have two choices – refactor the code into a library, or refactor the code into a DLL.

    If you refactor the code into a library, then you’ll save in complexity because the code will be used in common.  But you WON’T gain any savings in memory – each application will have its own set of pages dedicated to the contents of the shared library.

    If, on the other hand you decide to refactor the library into its own DLL, then you will still save in complexity, and you get the added benefit that the working set of ALL 50 applications is reduced – the pages occupied by the code in the DLL are shared between all 50 instances.

    You see, NT's pretty smart about DLL's (this isn’t unique to NT btw; most other operating systems that implement shared libraries do something similar).  When the loader maps a DLL into memory, it opens the file, and tries to map that file into memory at its preferred base address.  When this happens, memory management just says “The memory from this virtual address to this other virtual address should come from this DLL file”, and as the pages are touched, the normal paging logic brings them into memory.

    If they are, it doesn't go to disk to get the pages; it just remaps the pages from the existing file into the new process.  It can do this because the relocation fixups have already been fixed up (the relocation fixup table is basically a table within the executable that contains the address of every absolute jump in the code for the executable – when an executable is loaded in memory, the loader patches up these addresses to reflect the actual base address of the executable), so absolute jumps will work in the new process just like they would in the old.  The pages are backed with the file containing the DLL - if the page containing the code for the DLL’s ever discarded from memory, it will simply go back to the DLL file to reload the code pages. 

    If the preferred address range for the DLL isn’t available, then the loader has to do more work.  First, it maps the pages from the DLL into the process at a free location in the address space.  It then marks all the pages as Copy-On-Write so it can perform the fixups without messing the pristine copy of the DLL (it wouldn’t be allowed to write to the pristine copy of the DLL anyway).  It then proceeds to apply all the fixups to the DLL, which causes a private copy of the pages containing fixups to be created and thus there can be no sharing of the pages which contain fixups.

    This causes the overall memory consumption of the system goes up.   What’s worse, the fixups are performed every time that the DLL is loaded at an address other than the preferred address, which slows down process launch time.

    One way of looking at it is to consider the following example.  I have a DLL.  It’s a small DLL; it’s only got three pages in it.  Page 1 is data for the DLL, page 2 contains resource strings for the DLL, and page 3 contains the code for the DLL.  Btw, DLL’s this small are, in general, a bad idea.  I was recently enlightened by some of the office guys as to exactly how bad this is, at some point I’ll write about it (assuming that Raymond or Eric don’t beat me too it).

    The DLL’s preferred base address is at 0x40000 in memory.  It’s used in two different applications.  Both applications are based starting at 0x10000 in memory, the first one uses 0x20000 bytes of address space for its image, the second one uses 0x40000 bytes for its image.

    When the first application launches, the loader opens the DLL, maps it into its preferred address.  It can do it because the first app uses between 0x10000 and 0x30000 for its image.  The pages are marked according to the protections in the image – page 1 is marked copy-on-write (since it’s read/write data), page 2 is marked read-only (since it’s a resource-only page) and page 3 is marked read+execute (since it’s code).  When the app runs, as it executes code in the 3rd page of the DLL, the pages are mapped into memory.  The instant that the DLL writes to its data segment, the first page of the DLL is forked – a private copy is made in memory and the modifications are made to that copy. 

    If a second instance of the first application runs (or another application runs that also can map the DLL at 0x40000), then once again the loader maps the DLL into its preferred address.  And again, when the code in the DLL is executed, the code page is loaded into memory.  And again, the page doesn’t have to be fixed up, so memory management simply uses the physical memory that contains the page that’s already in memory (from the first instance) into the new application’s address space.  When the DLL writes to its data segment, a private copy is made of the data segment.

    So we now have two instances of the first application running on the system.  The space used for the DLL is consuming 4 pages (roughly, there’s overhead I’m not counting).  Two of the pages are the code and resource pages.  The other two are two copies of the data page, one for each instance.

    Now let’s see what happens when the second application (the one that uses 0x40000 bytes for its image).  The loader can’t map the DLL to its preferred address (since the second application occupies from 0x10000 to 0x50000).  So the loader maps the DLL into memory at (say) 0x50000.  Just like the first time, it marks the pages for the DLL according to the protections in the image, with one huge difference: Since the code pages need to be relocated, they’re ALSO marked copy-on-write.  And then, because it knows that it wasn’t able to map the DLL into its preferred address, the loader patches all the relocation fixups.  These cause the page that contains the code to be written to, and so memory management creates a private copy of the page.  After the fixups are done, the loader restores the page protection to the value marked in the image.  Now the code starts executing in the DLL.  Since it’s been mapped into memory already (when the relocation fixups were done), the code is simply executed.  And again, when the DLL touches the data page, a new copy is created for the data page.

    Once again, we start a second instance of the second application.  Now the DLL’s using 5 pages of memory – there are two copies of the code page, one for the resource page, and two copies of the data page.  All of which are consuming system resources.

    One think to keep in mind is that the physical memory page that backs resource page in the DLL is going to be kept in common among all the instances, since there are no relocations to the page, and the page contains no writable data - thus the page is never modified.

    Now imagine what happens when we have 50 copies of the first application running.  There are 52 pages in memory consumed by the DLL – 50 pages for the DLL’s data, one for the code, and one for the resources.

    And now, consider what happens if we have 50 copies of the second application running, Now, we get 101 pages in memory, just from this DLL!  We’ve got 50 pages for the DLL’s data, 50 pages for the relocated code, and still the one remaining for the resources.  Twice the memory consumption, just because the DLL was wasn’t rebased properly.

    This increase in physical memory isn’t usually a big deal when it’s happens only once. If, on the other hand, it happens a lot, and you don’t have the physical RAM to accommodate this, then you’re likely to start to page.  And that can result in “significantly reduced performance” (see this entry for details of what can happen if you page on a server).

    This is why it's so important to rebase your DLL's - it guarantees that the pages in your DLL will be shared across processes.  This reduces the time needed to load your process, and means your process working set is smaller.   For NT, there’s an additional advantage – we can tightly pack the system DLL’s together when we create the system.  This means that the system consumes significantly less of the applications address space.  And on a 32 bit processor, application address space is a precious commodity (I never thought I’d ever write that an address space that spans 2 gigabytes would be considered a limited resource, but...).

    This isn’t just restricted to NT by the way.  Exchange has a script that’s run on every build that knows what DLLs are used in what processes, and it rebases the Exchange DLL’s so that they fit into unused slots regardless of the process in which the DLL is used.  I’m willing to bet that SQL server has something similar.

    Credits: Thanks to Landy, Rick, and Mike for reviewing this for technical accuracy (and hammering the details through my thick skull).  I owe you guys big time.

     

  • Larry Osterman's WebLog

    Moving Offices

    • 27 Comments
    Well, last week, we had yet another office move.

    Office moves are sort-of a tradition at Microsoft, this one's something like my 20th.  Personally I think that management schedules them just to make sure we don't collect too much junk in our offices... 

    For me, it doesn't help, I moved 14 boxes of stuff this time (and a boatload of legos that were stashed in my grandmanagers office).

    As I said, moving's a regular occurrence - I'm in my 4th office in this building alone.  Fortunately, intra-building moves aren't NEARLY as painful as inter-building moves, but they're still a pain in the neck.

    My longest time in an office was something like two years, my shortest was 2 weeks (they moved us out of building one into building four for two weeks while they moved another group out of building two, then moved us from building four back into building two).  I've had corner offices (twice, once in building two, another time in 25), I've had window offices and I've had interior offices.  I've got to say that I REALLY hate corner offices - my office has a whiteboard, a corkboard and two bookshelves, but in a corner office, you lose one of your walls, which means that you can only have two of the 4 items (we have modular shelving and corkboard units in our offices, in an interior office, you get two walls full of hanging shelving racks, in a corner office, you only get one, plus a partial one).  The great view doesn't even come close to making up for the loss of a bookshelf.  In my case, one of my bookshelves is filled with lego models, but who's counting :)

    I can't wait to see the view from my new office though - it faces more-or-less northeast, which means that I get to see the Cascades.  I took the opportunity to reorient my office as well - traditionally, I have had my office laid out like this:

    But I'm laying my new office out like this:

    just to take advantage of the view (Ignore the units, they're Visio goop from when I made the drawing).  I like facing the door (so I can see who's coming), but I figured that the view would be worth the startle effect.  I suspect I'll end up getting a mirror to put into the window so I can see people at the door...  The cool thing about the new layout is that I'll be able to add a round table to the office, so I'll be able to get the manipulative puzzles off my main desk onto the round table.

    Unfortunately, this morning, just before came into work to unpack, the fan motor on the AC blower feeding into my office gave up the ghost, filling the office (and the corridor) with REALLY noxious fumes, so I'm currently installed in an empty office near my office (I'd forgotten how heavy a 21 inch CRT monitor is).

    Anyway, today's tech-light, hopefully I'll get bandwidth to do more tomorrow.

    Edit: Clarified text around new office layout, it was awkwards.

     

  • Larry Osterman's WebLog

    Office Decorations

    • 18 Comments

    One of the long standing traditions here at Microsoft is decorating other employee’s offices.

    Over the years, people have come up with some extraordinarily creative ways to trash others offices.  It’s truly awe inspiring how people use their imagination when they want to make mischief.

    One of my all-time favorites was done to one of the Xenix developers for his birthday.

    This particular developer had recently taken up golf.  So the members of his team came in one night, removed all the furniture from the office, and brought in sod to cover the office floor.

    They then cut a hole in the sod for the golf cup, mounted a golf pole (stolen from a nearby golf course, I believe), and put all the office furniture back in the office, making him his own in-office putting green.

    You could smell the sod from one side of the building to the other, it was that strong.

    I don’t want to think of how they ended cleaning it up.

     

  • Larry Osterman's WebLog

    Little Lost APIs

    • 32 Comments
    When you have an API set as large as the Win32 API set, sometimes APIs get "lost".  Either by forgetfulness, or by the evolution of the hardware platform.

    We've got one such set of APIs here in multimedia-land, they're the "aux" APIs.

    The "aux" APIs (auxGetNumDevs, auxGetDevCaps, auxGetVolume, auxSetVolume, and auxOutMessage) are intended to control the volume of the "aux" port on your audio adapter.

    It's a measure of how little used these are that when I asked around my group what the aux APIs did, the general consensus was "I don't know" (this isn't exactly true, but it's close).  We certainly don't know of any applications that actually uses these APIs.

    And that's not really surprising since the AUX APIs are used to control the volume of either the AUX input jack on your sound card or the output volume from a CDROM drive (if connected via the analog cable).

    What's that you say? Your sound card doesn't have an "AUX" jack?  That's not surprising, I'm not sure that ANY sound card has been manufactured in the past 10 years with an AUX input jack (they typically have a "LINE-IN" jack and a "MIC" jack).  And for at least the past 5 years, hardware manufacturers haven't been connecting the analog CD cable to the sound card (it enables them to save on manufacturing costs).

    Since almost every PC system shipped in the past many years (at least 5) has used digital audio extraction to retrieve the CD audio, the analog cable's simply not needed on most systems (there are some exceptions such as laptop machines, which use the analog connector to save battery life when playing back CD audio).  And even if a sound card were to add an AUX input, the "mixer" APIs provide a more flexable mechanism for managing those APIs anyway.

    So with the "aux" APIs, you have a set of APIs that were designed to support a series of technologies that are at this point essentially obsolete.  And even if your hardware used them, there's an alternate, more reliable set of APIs that provide the same functionality - the mixer APIs.  In fact, if you launch sndvol32.exe (the volume control applet), you can see a bunch of sliders to the right of the volume control - they're labeled things like "wave", "sw synth", "Line in", etc.  If your audio card has an "AUX" line, then you'll see an "Aux" volume control - that's the same control that the auxSetVolume and auxGetVolume API controls.  Similarly, there's likely to be a "CD Player" volume control - that's the volume for the CD-ROM control (and it works for both digital and analog CD audio).  So all the "aux" API functionality is available from the "mixer" APIs, but the mixer version works in more situations.

    But even so, the "aux" APIs still exist in the system in the event that someone might still be calling them...  Even if there's no hardware on the system which would be controlled by these APIs, they still exist.

    These APIs are one of the few examples of APIs where it's actually possible that we might be able to end-of-life the APIs - they'll never be removed from the system, but a time might come in the future where the APIs simply stop working (auxGetNumDevs will return 0 in that case indicating that there are no AUX devices on the system).

    Edit: Clarified mixer and aux API relationship a bit to explain how older systems would continue to work.

  • Larry Osterman's WebLog

    What's the big deal with the Moore's law post?

    • 19 Comments
    In yesterday's article, Jeff made the following comment:

    I don't quite get the argument. If my applications can't run on current hardware, I'm dead in the water. I can't wait for the next CPU.

    The thing is that that's the way people have worked for the past 20 years.  A little story goes a long way of describing how the mentality works.

    During the NT 3.1 ship party, a bunch of us were standing around Dave Cutler, while he was expounding on something (aside: Have you ever noticed this phenomenon?  Where everybody at a party clusters around the bigwig?  Sycophancy at its finest).  The topic on hand at this time (1993) was Windows NT's memory footprint.

    When we shipped Windows NT, the minimum memory requirement for the system was 8M, the recommended was 12M, and it really shined at somewhere between 16M and 32M of memory.

    The thing was that Windows 3.1 and OS/2 2.0 both were targeted at machines with between 2M and 4M of RAM.  We were discussing why NT4 was so big.

    Cutlers response was something like "It doesn't matter that NT uses 16M of RAM - computer manufacturers will simply start selling more RAM, which will put pressure on the chip manufacturers to drive their RAM prices down, which will make this all moot". And the thing is, he was right - within 18 months of NT 3.1's shipping, memory prices had dropped to the point where it was quite reasonable for machines to come out with 32M and more RAM. Of course, the fact that we put NT on a severe diet for NT 3.5 didn't hurt (NT 3.5 was almost entirely about performance enhancements).

    It's not been uncommon for application vendors to ship applications that only ran well on cutting edge machines with the assumption that most of their target customers would be upgrading their machine within the lifetime of the application (3-6 months for games (games are special, since gaming customers tend to have bleeding edge machines since games have always pushed the envelope), 1-2 years for productivity applications, 3-5 years for server applications), and thus it wouldn't matter if their app was slow on current machines.

    It's a bad tactic, IMHO - an application should run well on both the current generation and the previous generation of computers (and so should an OS, btw).  I previously mentioned one tactic that was used (quite effectively) to ensure this - for the development of Windows 3.0, the development team was required to use 386/20's, even though most of the company was using 486s.

    But the point of Herb's article is that this tactic is no longer feasible.  From now on, CPUs won't continue to improve exponentially.  Instead, the CPUs will improve in power by getting more and more parallel (and by having more and more cache, etc).  Hyper-threading will continue to improve, and while the OS will be able to take advantage of this, applications won't unless they're modified.

    Interestingly (and quite coincidentally) enough, it's possible that this performance wall will effect *nix applications more than it will affect Windows applications (and it will especially effect *nix derivatives that don't have a preemptive kernel and fully asynchronous I/O like current versions of Linux do).  Since threading has been built into Windows from day one, most of the high concurrency application space is already multithreaded.  I'm not sure that that's the case for *nix server applications - for example, applications like the UW IMAP daemon (and other daemons that run under inetd) may have quite a bit of difficulty being ported to a multithreaded environment, since they were designed to be single threaded (other IMAP daemons (like Cyrus) don't have this limitation, btw).  Please note that platforms like Apache don't have this restriction since (as far as I know), Apache fully supports threads.

    This posting is provided "AS IS" with no warranties, and confers no rights.

  • Larry Osterman's WebLog

    The dirty little secret of Windows volume

    • 8 Comments

    Here's a dirty little secret about volume in Windows.

    If you look at the documentation for waveOutSetVolume it very clearly says:

    Volume settings are interpreted logarithmically. This means the perceived increase in volume is the same when increasing the volume level from 0x5000 to 0x6000 as it is from 0x4000 to 0x5000.

    The implication of this is that you can implement a linear slider for volume control and use the position of the slider to represent the volume.  This is pretty cool.

    But if you've ever written an application that uses the waveform volume (say an app that plays content with a volume slider attached to it), you'll notice that your volume control is far more responsive when it's on the low end of the slider and less responsive on the high end of the slider.

    Logarithmic curve

    That's weird.  The volume settings are supposed to be logarithmic, but a slider that's more responsive at the low end of the scale than the high end of the scale is an indicator that the slider's controlling LINEAR volume.

    And that's the dirty little secret.  Even though the wave volume is supposed to be logarithmic, the wave volume is actually linear.

    What's worse is that we didn't notice this until we shipped Media Center Edition.  The PM for my group was playing with his MCE machine and noticed that the volume was linear.   To confirm it, he whipped out his sound pressure meter (he's a recording  artist so he has stuff like that in his house).  And yup, the volume control was linear.

    When he came back to work the next day, panic ensued.  I can't explain WHY nobody had noticed this, but they hadn't.

    In response, we added support (for XP SP2) for customized volume tapers for the audio APIs.  The results of that are discussed in this article.

     

    Interestingly enough, it appears that this problem is well known.  The article from which I stole this image discusses the problem of linear vs. logarithmic tapers and discusses how to find the optimal volume taper.

     

    Edit: Cleared up some ambiguities in the language.

  • Larry Osterman's WebLog

    Nathan's laws of software

    • 16 Comments
    Way back in 1997, Nathan Myhrvold (CTO of Microsoft at the time) wrote a paper entitled "The Next Fifty Years of Software" (Subtitled "Software: The Crisis Continues!")  which was presented at the ACM97 conference (focused on the next 50 years of computing).

    I actually attended an internal presentation of this talk, it was absolutely riveting. Nathan's a great public speaker, maybe even better than Michael Howard :).

    But an email I received today reminded me of Nathan's First Law of Software:  "Software is a Gas!"

    Nathan's basic premise is that as machines get bigger, the software that runs on those computers will continue to grow. It doesn't matter what kind of software it is, or what development paradigm is applied to that software.  Software will expand to fit the capacity of the container.

    Back in the 1980's, computers were limited.  So software couldn't do much.  Your spell checker didn't run automatically, it needed to be invoked separately.  Nowadays, the spell checker runs concurrently with the word processor.

    The "Bloatware" phenomenon is a direct consequence of Nathan's First Law.

    Nathan's second law is also fascinating: "Software grows until it becomes limited by Moore's Law". 

    The second law is interesting because we're currently nearing the end of the cycle of CPU growth brought on by Moore's law.  So in the future, the growth of software is going to become significantly constrained (until some new paradigm comes along).

    His third law is "Software growth makes Moore's Law possible".  Essentially he's saying that because software grows to hit the limits of Moore's law, software regularly comes out that pushes the boundaries.  And that's what drives hardware sales.  And the drive for ever increasing performance drives hardware manufacturers to make even faster and smaller machines, which in turn makes Moore's Law a reality.

    And I absolutely LOVE Nathan's 4th law.  "Software is only limited by human ambition and expectation."   This is so completely true.  Even back when the paper was written, the capabilities of computers today were mere pipe dreams.  Heck, in 1997, you physically couldn't have a computer with a large music library - a big machine in 1997 had a 600M hard disk.

    What's also interesting is the efforts in fighting Nathan's first law.  It's a constant fight, waged by diligent performance people against the hoards of developers who want to add their new feature to the operating system.  All the developers want to expand their features.  And the perf people need to fight back to stop them (or at least make them justify what they're doing).  The fight is ongoing, and unending.

    Btw, check out the slides they're worth reading.  Especially when he gets to the part where the stuff that makes you genetically unique fits on a 3 1/2" floppy drive.

    He goes on from that point - at one point in his presentation, he pointed out that the entire human sensory experience can be transmitted easily on a 100mB ethernet connection.

     

    Btw, for those of you who would like, there's a link to two different streaming versions of the talk here: http://research.microsoft.com/acm97/

     

    Edit: Added link to video of talk.

     

Page 4 of 33 (815 items) «23456»