Software Engineering

Larry Osterman's WebLog

Confessions of an Old Fogey
  • Larry Osterman's WebLog

    Read-Only and Write-Only computer languages

    • 17 Comments

    A colleague and I were chatting the other day and we were talking about STL implementations (in the context of a broader discussion about template meta-programming and how difficult it is).

     

    During our discussion, I described the STL implementation as “read-only” and he instantly knew what I was talking about.  As we dug in further, I realized that for many languages, you can characterize computer languages as read-only and write-only[1]

    Of course there’s a huge amount of variation here – it’s always possible to write incomprehensible code, but there are languages that just lend themselves to being read-only or write-only.

    A “read-only” language is a language that anyone can understand when reading it, but you wouldn’t even begin to be able to know how to write (or modify) code in that language.  Languages that are read-only tend to have very subtle syntax – it looks like something familiar, but there are magic special characters that change the meaning of the code.  As I mentioned above, template meta-programming can be thought of as read-only, if you’ve ever worked with COBOL code, it also could be considered to be read-only.

    A “write-only” language is a language where only the author of the code understands what it does.  Languages can be write-only because of their obscure syntax, they can be write-only because of their flexibility.   The canonical example of the first type of write-only language is Teco (which was once described to me as “the only computer language whose syntax is indistinguishable from line noise”[2]).  But there are other languages that are also write-only.   For instance JavaScript and Perl are often considered to be write-only – the code written is often indecipherable to a knowledgeable viewer (but is almost always totally understandable to the author of the code).  It’s possible to write legible JS and Perl, but all too often, the code is impenetrable to the casual observer.

     

    Of course anyone for someone who’s very familiar with a particular language, the code written in that language is often understandable – back when I was coding in Teco on a daily basis (and there was a time when I spent weeks working on Emacs (the original Emacs written by RMS, not the replacement written by Jim Gosling) extensions), I could easily read Teco code.  But that’s only when you spend all your time living and breathing the code.

     

     

     

     

    [1] I can’t take credit for the term “read-only”, I first heard the term from Miguel de Icaza at the //Build/ conference a couple of weeks ago.

    [2] “line noise” – that’s the random characters that are inserted into the character stream received by an acoustic modem – these beasts no longer exists in todays broadband world, but back in the day, line noise was a real problem.

  • Larry Osterman's WebLog

    Getting started with test driven development

    • 9 Comments

    I'm at the build conference in Anaheim this week, and I was in the platform booth when a customer asked me a question I'd not been asked before: "How do you get started with test driven development".  My answer was simply "just start - it doesn't matter how much existing code you already have, just start writing tests alongside your new code.  Get a good unit test framework like the one in Visual Studio, but it really doesn't matter what framework you use, just start writing the tests".

    This morning, I realized I ought to elaborate on my answer a bit.

    I'm a huge fan of Test Driven Development.  Of all the "eXtreme Programming" methodologies, TDD is by far the one that makes the most sense.  I started using TDD back in Windows 7.  I had read about TDD over the years, and was intrigued by the concept but like the customer, I didn't really know where to start.  My previous project had extensive unit tests, but they really didn't use any kind of methodology when developing them.  When it came time to develop a new subsystem for the audio stack for Windows 7 (the feature that eventually became the "capture monitor/listen to" feature), I decided to apply TDD when developing the feature just to see how well it worked.  The results far exceeded my expectations.

    To be fair, I don't follow the classic TDD paradigm where you write the tests first, then write the code to make sure the tests pass.  Instead I write the tests at the same time I'm writing the code.  Sometimes I write the tests before the code, sometimes the code before the tests, but they're really written at the same time.

    In my case, I was fortunate because the capture monitor was a fairly separate piece of the audio stack - it is essentially bolted onto the core audio engine.  That meant that I could develop it as a stand-alone system.  To ensure that the capture monitor could be tested in isolation, I developed it as a library with a set of clean APIs.  The interface with the audio engine was just through those clean APIs.  By reducing the exposure of the capture monitor APIs, I restricted the public surface I needed to test.

    But I still needed to test the internal bits.  The good news is that because it was a library, it was easy to add test hooks and enable the ability to drive deep into the capture monitor implementation.  I simply made my test classes friends of the implementation classes and then the test code could call into the protected members of the various capture monitor classes.  This allowed me to build test cases that had the ability to simulate internal state changes which allowed me to build more thorough tests.

    I was really happy with how well the test development went, but the proof about the benefits of TDD really shown when it was deployed as a part of the product. 

    During the development of Windows 7, there were extremely few (maybe a half dozen?) bugs found in the capture monitor that weren't first found by my unit tests.  And because I had such an extensive library of tests, I was able to add regression test cases for those externally found tests.

    I've since moved on from the audio team, but I'm still using TDD - I'm currently responsible for two tools in the Windows build system/SDK and both of them have been developed with TDD.  One of them (the IDL compiler used by Windows developers for creating Windows 8 APIs) couldn't be developed using the same methodology as I used for the capture monitor, but the other (mdmerge, the metadata composition tool) was.  Both have been successful - while there have been more bugs found externally in both the IDL compiler and mdmerge than were found in the capture monitor, the regression rate on both tools has been extremely low thanks to the unit tests.

    As I said at the beginning, I'm a huge fan of TDD - while there's some upfront cost associated with creating unit tests as you write the code, it absolutely pays off in the long run with a higher initial quality and a dramatically lower bug rate.

  • Larry Osterman's WebLog

    Nobody ever reads the event logs…

    • 19 Comments

    In my last post, I mentioned that someone was complaining about the name of the bowser.sys component that I wrote 20 years ago.  In my post, I mentioned that he included a screen shot of the event viewer.

    What was also interesting thing was the contents of the screen shot.

    “The browser driver has received too many illegal datagrams from the remote computer <redacted> to name <redacted> on transport NetBT_Tcpip_<excluded>.  The data is the datagram.  No more events will be generated until the reset frequency has expired.”

    I added this message to the browser 20 years ago to detect computers that were going wild sending illegal junk on the intranet.  The idea was that every one of these events indicated that something had gone horribly wrong on the machine which originated the event and that a developer or network engineer should investigate the problem (these illegal datagrams were often caused by malfunctioning networking hardware (which was not uncommon 20 years ago)).

    But you’ll note that the person reporting the problem only complained about the name of the source of the event log entry.  He never bothered to look at the contents of this “error” event log entry to see if there was something that was worth reporting.

    Part of the reason that nobody bothers to read the event logs is that too many components log to the eventlog.  The event logs on customers computers are filled with unactionable meaningless events (“The <foo> service has started.  The <foo> service has entered the running state.  The <foo> service is stopping.  The <foo> service has entered the stopped state.”).  And they stop reading the event log because there’s never anything actionable in the logs.

    There’s a pretty important lesson here: Nobody ever bothers reading event logs because there’s simply too much noise in the logs. So think really hard about when you want to write an event to the event log.  Is the information in the log really worth generating?  Is there important information that a customer will want in those log entries?

    Unless you have a way of uploading troublesome logs to be analyzed later (and I know that several enterprise management solutions do have such mechanisms), it’s not clear that there’s any value to generating log entries.

  • Larry Osterman's WebLog

    Reason number 9,999,999 why you don’t ever use humorous elements in a shipping product

    • 4 Comments

    I just saw an email go by on one of our self hosting aliases:

    From: <REDACTED>
    Sent: Saturday, April 30, 2011 12:27 PM
    To: <REDACTED>
    Subject: Spelling Mistake for browser in event viewer

    Not sure which team to assign this to – please pick up this bug – ‘bowser’ for ‘browser’

    And he included a nice screen shot of the event viewer pointing to an event generated by bowser.sys.

    The good news is that for once I didn’t have to answer the quesion.  Instead my co-workers answered for me:

    FYI: People have been filing bugs for this for years. Larry Osterman wrote a blog post about it. J

    http://blogs.msdn.com/b/larryosterman/archive/2006/03/14/551368.aspx

    <Redacted>

    From: <Redacted>
    Sent: Saturday, April 30, 2011 1:54 PM
    To: <Redacted>

    Subject: RE: Spelling Mistake for browser in event viewer

    The name of the service is (intentionally) bowser and has been so for many releases.

    My response:

    “many releases”.  That cracks me up.  If I had known that I would literally spend the next 20 years paying for that one joke, I would have reconsidered it.

    And yes, bowser.sys has been in the product for 20 years now.

     

    So take this as an object lesson.  Avoid humorous names in your code or you’ll be answering questions about them for the next two decades and beyond.  If I had named the driver “brwsrhlp.sys” (at that point setup limited us to 8.3 file names) instead of “bowser.sys” it would never have raised any questions.  But I chose to go with a slightly cute name and…

     

    PS: After posting this, several people have pointed out that the resources on bowser.sys indicate that it's name should be "browser.sys".  And they're right.  To my knowledge, nobody has noticed that in the past 20 years...

  • Larry Osterman's WebLog

    The case of the inconsistent right shift results…

    • 17 Comments

    One of our testers just filed a bug against something I’m working on.  They reported that if they compiled code which calculated: 1130149156 >> –05701653 it generated different results on 32bit and 64bit operating systems.  On 32bit machines it reported 0 but on 64bit machines, it reported 0x21a.

    I realized that I could produce a simple reproduction for the scenario to dig into it a bit deeper:

    int _tmain(int argc, _TCHAR* argv[])
    {
        __int64 shift = 0x435cb524;
        __int64 amount = 0x55;
        __int64 result = shift >> amount;
        std::cout << shift << " >> " << amount << " = " << result << std::endl;
        return 0;
    }

    That’s pretty straightforward and it *does* reproduce the behavior.  On x86 it reports 0 and on x64 it reports 0x21a.  I can understand the x86 result (you’re shifting right more than the processor size, it shifts off the end and you get 0) but not the x64. What’s going on?

    Well, for starters I asked our C language folks.  I know I’m shifting by more than the processor word size (85), but the results should be the same, right?

    Well no.  The immediate answer I got was:

    From C++ 03, 5.8/1: The behavior is undefined if the right operand is negative, or greater than or equal to the length in bits of the promoted left operand.

    Ok.  It’s undefined behavior.  But that doesn’t really explain the difference.  When in doubt, let’s go to the assembly….

    000000013F5215D3  mov         rax,qword ptr [amount]  
    000000013F5215D8  movzx       ecx,al  
    000000013F5215DB  mov         rax,qword ptr [shift]  
    000000013F5215E0  sar         rax,cl  
    000000013F5215E3  mov         qword ptr [result],rax  
    000000013F5215E8  mov         rdx,qword ptr [shift] 

    The relevant instruction is highlighted.  It’s doing a shift arithmetic right of “shift” by “amount”.

    What about the x86 version?

    00CC14CA  mov         ecx,dword ptr [amount]  
    00CC14CD  mov         eax,dword ptr [shift]  
    00CC14D0  mov         edx,dword ptr [ebp-8]  
    00CC14D3  call        @ILT+85(__allshr) (0CC105Ah)  
    00CC14D8  mov         dword ptr [result],eax  
    00CC14DB  mov         dword ptr [ebp-28h],edx  

    Now that’s interesting.  The x64 version is using a processor shift function but on 32bit machines, it’s using a C runtime library function (__allshr).  And the one that’s weird is the x64 version.

    While I don’t have an x64 processor manual, I *do* have a 286 processor manual from back in the day (I have all sorts of stuff in my office).  And in my 80286 manual, I found:

    “If a shift count greater than 31 is attempted, only the bottom five bits of the shift count are used. (the iAPX 86 uses all eight bits of the shift count.)”

    A co-worker gave me the current text:

    The destination operand can be a register or a memory location. The count operand can be an immediate value or the CL register. The count is masked to 5 bits (or 6 bits if in 64-bit mode and REX.W is used). The count range is limited to 0 to 31 (or 63 if 64-bit mode and REX.W is used). A special opcode encoding is provided for a count of 1.

    So the mystery is now solved.  The shift of 0x55 only considers the low 6 bits.  The low 6 bits of 0x55 is 0x15 or 21.  0x435cb524 >> 21 is 0x21a.

    One could argue that this is a bug in the __allshr function on x86 but you really can’t argue with “the behavior is undefined”.  Both scenarios are doing the “right thing”.  That’s the beauty of the “behavior is undefined” wording.  The compiler would be perfectly within spec if it decided to reformat my hard drive when it encountered this (although I’m happy it doesn’t Smile).

    Now our feature crew just needs to figure out how best to resolve the bug.

  • Larry Osterman's WebLog

    Why does Windows still place so much importance on filenames?

    • 35 Comments

    Earlier today, Adrian Kingsley-Hughes posted a rant (his word, not mine) about the fact that Windows still relies on text filenames.

    The title says it all really. Why is it that Windows still place so much importance on filenames.

    Take the following example - sorting out digital snaps. These are usually automatically given daft filenames such as IMG00032.JPG at the time they are stored by the camera. In an ideal world you’d only ever have one IMG00032.JPG on your entire system, but the world is far from perfect. Your camera might decide to restart its numbering system, or you might have two cameras using the same naming format. What happens then?

    I guess I’m confused.  I could see a *very* strong argument against Windows dependency on file extensions, but I’m totally mystified about why having filenames is such a problem.

    At some level, Adrian’s absolutely right – it IS possible to have multiple files on the hard disk named “recipe.txt”.  And that’s bad.  But is it the fault of Windows for allowing multiple files to have colliding names? Or is it the fault of the user for choosing poor names?  Maybe it’s a bit of both.

    What would a better system look like?  Well Adrian gives an example of what he’s like to see:

    Why? Why is the filename the deciding factor? Why not something more unique? Something like a checksum? This way the operating system could decide is two files really are identical or not, and replace the file if it’s a copy, or create a copy if they are different. This would save time, and dramatically reduce the likelihood of data loss through overwriting.

    But how would that system work?  What if we did just that.  Then you wouldn’t have two files named recipe.txt (which is good).

    Unfortunately that solution introduces a new problem: You still have two files.  One named “2B1015DB-30CA-409E-9B07-234A209622B6” and the other named “5F5431E8-FF7C-45D4-9A2B-B30A9D9A791B”. It’s certainly true that those two files are uniquely named and you can always tell them apart.  But you’ve also lost a critical piece of information: the fact that they both contain recipes.

    That’s the information that the filename conveys.  It’s human specific data that describes the contents of the file.  If we were to go with unique monikers, we’d lose that critical information.

    But I don’t actually think that the dependency on filenames is really what’s annoying him.  It’s just a symptom of a different problem. 

    Adrian’s rant is a perfect example of jumping to a solution without first understanding the problem.  And why it’s so hard for Windows UI designers to figure out how to solve customer problems – this example is a customer complaint that we remove filenames from Windows.  Obviously something happened to annoy Adrian that was related to filenames, but the question is: What?  He doesn’t describe the problem, but we can hazard a guess about what happened from his text:

    Here’s an example. I might have two files in separate folders called recipe.txt, but one is a recipe for a pumpkin pie, and the other for apple pie. OK, it was dumb of me to give the files the same name, but it’s in situations like this that the OS should be helping me, not hindering me and making me pay for my stupidity. After all, Windows knows, without asking me, that the files, even if they are the same size and created at exactly the same time, are different. Why does Windows need to ask me what to do? Sure, it doesn’t solve all problems, but it’s a far better solution than clinging to the notion of filenames as being the best metric by which to judge whether files are identical or not.

    The key information here is the question: “Why does Windows need to ask me what to do?”  My guess is that he had two “recipe.txt” files in different directories and copied a recipe.txt from one directory to the other.  When you do that, Windows presents you with the following dialog:

    Windows Copy Dialog

    My suspicion is that he’s annoyed because Windows is forcing him to make a choice about what to do when there’s a conflict.  The problem is that there’s no one answer that works for all users and all scenarios.    Even in my day-to-day work I’ve had reason to chose all three options, depending on what’s going on.  From the rant, it appears that Adrian would like it to chose “Copy, but keep both files” by default.  But what happens if you really *do* want to replace the old recipe.txt with a new version?  Maybe you edited the file offline on your laptop and you’re bringing the new copy back to your desktop machine.  Or maybe you’re copying a bunch of files from one drive to another (I do this regularly when I sync my music collection from home and work).  In that case, you want to ignore the existing copy of the file (or maybe you want to copy the file over to ensure that the metadata is in sync).

    Windows can’t figure out what the right answer is here – so it prompts the user for advice about what to do.

    Btw, Adrian’s answer to his rhetorical question is “the reason is legacy”.  Actually that’s not quite it.  The reason is that it’s filenames provide valuable information for the user that would be lost if we went away from them.

    Next time I want to spend a bit of time brainstorming about ways to solve his problem (assuming that the problem I identified is the real problem – it might not be). 

     

     

    PS: I’m also not sure why he picked on Windows here.  Every operating system I know of has similar dependencies on filenames.  I think that’s an another indication that he’s jumping on a solution without first describing the problem.

  • Larry Osterman's WebLog

    Not Invented Here’s take on software security

    • 3 Comments

    One of my favorite web comics is Not Invented Here by Bill Barnes and Paul Southworth.  I started reading Bill’s stuff with his other web comic Unshelved (a librarian comic).

     

    NIH is a web comic about software development and this week Bill and Paul have decided to take on software security…

    Here’s Monday’s comic:

    Not Invented Here strip for 2/15/2010

     

    Check them out – Bill and Paul both have a good feel for how the industry actually works :).

  • Larry Osterman's WebLog

    I can make it arbitrarily fast if I don’t actually have to make it work.

    • 27 Comments

    Digging way back into my pre-Microsoft days, I was recently reminded of a story that I believe was told to me by Mary Shaw back when I took her Computer Optimization class at Carnegie-Mellon…

    During the class, Mary told an anecdote about a developer “Sue” who found a bug in another developer’s “Joe” code that “Joe” introduced with a performance optimization.  When “Sue” pointed the bug out to “Joe”, his response was “Oops, but it’s WAY faster with the bug”.  “Sue” exploded “If it doesn’t have to be correct, I can calculate the result in 0 time!” [1].

    Immediately after telling this anecdote, she discussed a contest that the CS faculty held for the graduate students every year.  Each year the CS faculty posed a problem to the graduate students with a prize awarded to the grad student who came up with the most efficient (fastest) solution to the problem.  She then assigned the exact same problem to us:

    “Given a copy of the “Declaration of Independence”, calculate the 10 most common words in the document”

    We all went off and built programs to parse the words in the document, inserting them into a tree (tracking usage) and read off the 10 most frequent words.  The next assignment was “Now make it fast – the 5 fastest apps get an ‘A’, the next 5 get a ‘B’, etc.”

    So everyone in the class (except me :)) went out and rewrote their apps to use a hash table so that their insertion time was constant and then they optimized the heck out of their hash tables[2].

    After our class had our turn, Mary shared the results of what happened when the CS grad students were presented with the exact same problem.

    Most of them basically did what most of the students in my class did – built hash tables and tweaked them.  But a couple of results stood out.

    • The first one simply hard coded the 10 most common words in their app and printed them out.  This was disqualified because it was perceived as breaking the rules.
    • The next one was quite clever.  The grad student in question realized that they could write the program much faster if they wrote it in assembly language.  But the rules of the contest required that they use Pascal for the program.  So the grad student essentially created an array on the stack and introduced a buffer overflow and he loaded his assembly language program into the buffer and used that as a way of getting his assembly language version of the program to run.  IIRC he wasn’t disqualified but he didn’t win because he circumvented the rules (I’m not sure, it’s been more than a quarter century since Mary told the class this story).
    • The winning entry was even more clever.  He realized that he didn’t actually need to track all the words in the document.  Instead he decided to track only some of the words in the document in a fixed array.  His logic was that each of the 10 most frequent words were likely to appear in the first <n> words in the document so all he needed to do was to figure out what "”n” is and he’d be golden.

     

    So the moral of the story is “Yes, if it doesn’t have to be correct, you can calculate the response in 0 time.  But sometimes it’s ok to guess and if you guess right, you can get a huge performance benefit from the result”. 

     

     

    [1] This anecdote might also come from Jon L. Bentley’s “Writing Efficient Programs”, I’ll be honest and say that I don’t remember where I heard it (but it makes a great introduction to the subsequent story).

    [2] I was stubborn and decided to take my binary tree program and make it as efficient as possible but keep the basic structure of the solution (for example, instead of comparing strings, I calculated a hash for the string and compared the hashes to determine if strings matched).  I don’t remember if I was in the top 5 but I was certainly in the top 10.  I do know that my program beat out most of the hash table based solutions.

  • Larry Osterman's WebLog

    Digging into the history bin (AKA: Microsoft Developer says that Windows is useless)

    • 14 Comments

    As I was writing my “25 years of Larry’s history at Microsoft in 1 year chunks” blog posts, I spent a fair amount of time digging through my email archives (trying to figure out exactly what happened at what time).  During this, I ran into a link to a post I’d made on the Info-IBMPC mailing list mailing list back in 1992:

    Date: Thu, 12 Mar 92 12:44:39 PST
    From: lar...@microsoft.com
    Subject: What do you do with your windows? (V92 #36)

    || >From: m...@Violin.CC.MsState.Edu (Mubashir Cheema)

    ||   I recently acquired Windows 3.0 and I don't seem to understand one
    || thing.  What is it for?  What do I do with it?  What major advantage
    || does it have over Dos?  (I don't see any except being able to use mouse
    || and also the thing is bit more colorful) I think it was made for lazy
    || people who couldn't learn couple of DOS commands.

    ||   Don't tell me I could multi-task with it. I've been using Amigas
    || extensively

    I've got to jump in here, even though I suspect that there will probably be some form of an "official" response from MS if anyone in the DOS/Windows group is listening...... :)

    I'm going to be brutally honest about this one. Basically, Windows by itself IS pretty useless. The thing that makes Windows great is the same thing that has made DOS the most popular operating system in history. It's the applications that are available for it.

    GUI's (Graphical User Interfaces) have been proven to be significantly easier for users to understand for beginning users, and are arguably the wave of the future. I don't know of a significant operating system being introduced for the PC market that doesn't have a GUI available on it, be it PM, X, GEM, or Windows. Windows is arguably the best GUI available for DOS based on what I consider the most significant criteria: What applications are available for the platform.

    Consider the list of available windows apps: Excel, WinWord, PageMaker, Corel Draw, WordPerfect, Lotus 123, etc just to name a couple off the
    top of my head.

    You also hit on one of the significant reasons to use Windows - Multi-tasking.

    Windows is a non pre-emptive multi-tasking operating system.  On a 386, it does an ok job of multi-tasking multiple DOS applications, but on a
    286 it functions as a simple task switcher like DOS 5 does.  It really shines when multi-tasking Windows applications however.

    In addition, when you couple the multi-tasking capabilities of Windows with a windows mechanism known as DDE (for Dynamic Data Exchange), you
    can generate some truly incredible synergy between Windows applications. With Win 3.0/Win 3.1 Microsoft has introduced a concept
    known as OLE (Open Linking and Embedding) which allows you to cut and past from multiple "applets" allowing applications to take advantage of
    the capabilities of other shipped applications.  This allows an applet like an equation editor to manage all the information about formatting
    an equation even when the equation is embedded in a word document. With OLE, you can simply double-click on the object and bring up the
    "agent" that manages it (in my example, the equation editor).

    For application developers, Windows gives developers the ability to develop their applications without knowing anything about the
    underlying hardware of the machine - a windows application that runs on a machine with a CGA adapter will also run on a machine with a graphics
    accelerator that runs in 1024x1024 with 24 bits of color.

    In addition, when you write an application for windows, your application instantly will support literally hundreds of printers
    transparently - Windows does all the work for you.

    To re-iterate, Windows as a stand-alone product is not extraordinarily interesting - there are lots of productivity packages that provide
    similar functionality to users, the real benefit of Windows is the applications that run on it.

    I will also point out that there are more than 5000 Windows applications available today and still more will come out with Win 3.1.
    The available windows applications span all ranges of applications from games (Microsoft's Entertainment pack, Berkley-Soft's After Dark, and
    Sierra's Laffer Utilities for Windows) to Spreadsheets (Microsoft Excel, Lotus 1-2-3) to Word processors (Microsoft Word For Windows,
    Lotus Ami), to Desktop publishing (Aldus Pagemaker, Microsoft Publisher), to presentation graphics (Microsoft Powerpoint), to
    development tools (Microsoft Visual Basic) etc......

    Larry Osterman

    Disclaimer:  The opinions above are my own.  They are not necessarily the same as those of Microsoft.  I only work here.

    Remember that this was written back in 1992 after Windows 3.0 had come out but before Windows 3.1.  There was no Win32, no web browser, no multimedia support, none of the things that we all take for granted in a modern system.  Back then a display card that supported 1Kx1K with 24bit color was considered a monster display card (and hard disks still came in “megabytes” – I remember buying a 2G hard disk back then for about a thousand dollars).

    Reading this again, I find it vaguely funny that in many ways my feelings about Windows haven’t really changed that much in 18 years – the value of the Windows platform is STILL the applications available for that platform (although the number of applications has grown from the 5000 or so back in 1992 to several million applications).

  • Larry Osterman's WebLog

    Thinking about Last Checkin Chicken

    • 8 Comments

    Raymond Chen’s post today started me thinking about “Last Check-in Chicken” again.  Back in the says when we were close to shipping Windows Vista, I wrote about ”Last Check-in Chicken”.  What I didn’t mention was who ultimately won the game for Windows Vista.

    It turns out that the very last change to Windows Vista was actually made by one of the developers on the sound team.

     

    When you reach the last few days of a project, the bar for taking changes is insanely high – the teams which approve changes to the product get increasingly more conservative about taking changes – every change taken is an opportunity for regression and resets some amount of the testing which has gone before.  So the number of bugs that are accepted towards the end of a product gets smaller and smaller. You can think of the ability to take bugs as a series of ever increasingly high barriers – it starts fairly low – just about any bug fix will be accepted into the tree.  This is the normal state during most of product development.  As time goes on and the team gets closer to shipping, the bug bar gets raised and the bugs that are considered are only those that are going to affect customers directly (as opposed to those bugs found during testing won’t necessarily be encountered by customers).  Then the bar gets raised again (and again, and again) until eventually it gets to the point where the only bugs that are accepted are “recall class” bugs[1].

    The idea behind a “recall class bug” is that it’s is a bug that is so bad that we’d be willing to call the manufacturer and pull the product off the assembly line (at a cost of millions of dollars) to fix.  These are the worst-of-the-worst bugs, and typically involve major scenarios not working.   When the bug bar is at “recall class only”, there are typically only two or three bugs that are considered each day across all of Windows and even then most of the bugs brought up to the triage team aren’t accepted.

    At some point the bug bar gets beyond even “recall class only” – this is when you’re REALLY close to being done (typically the last two or three days of a product).  Normally builds of the product are done daily because there are one or two “recall class” bugs still being accepted.  But eventually all those bugs are fixed and the build team stops doing daily builds because there have been no changes since the previous build.  The test team is hard at work doing it’s final sign-off of the bits and everybody is on tenterhooks waiting for the final build to come out.  When you’re at this stage of the product, every once in a while a change comes in that would be really nice to have because it fixes a critical issue with an important scenario, but it’s just just not important enough to justify cracking open the bits to take the change.  Raymond calls these type of changes “Remora Check-ins”.   The idea is that if another bug was discovered during the final testing phase that forced us to rebuild the system, we would take these “Remora Check-ins” along for the ride.

    In our case, the change we made was a Remora check-in – it was an important bug, but it wasn’t important enough to justify resetting the final test pass.  But someone else’s component had a critical bug that HAD to be fixed and our change came along for the ride (and no, I don’t remember exactly what either of the changes were, I just know that our check-in was chronologically the last one made).

     

    Nitpickers corner: None of the information in this post should be particularly controversial – much of what I’ve described here is software engineering 101.  There’s always a bar for taking bug fixes in every product – if there weren’t, you’d never ship the product (for example, the Mozilla Foundation shipped Firefox version 3.5 today (congrats!) and they still have several dozens of critical bugs active in their database – I’m sure that these are all bugs that didn’t meet their bug bar).  Heck, there’s even a book that’s all about the process of shipping NT 3.1 that covers much of this information.

     

    ----

    [1] In the past these bugs would be called “Show Stoppers”.

  • Larry Osterman's WebLog

    Everyone wants a shiny new UI

    • 55 Comments

    Surfing around the web, I often run into web sites that contain critiques of various aspects of Windows UI.

    One of the most common criticisms on those sites is "old style" dialogs.  In other words, dialogs that don't have the most up-to-date theming.  Here's an example I ran into earlier today:

    AutoComplete

    Windows has a fair number of dialogs like this - they're often fairly old dialogs that were written before new theming elements were added (or contain animations that predate newer theming options).  They all work correctly but they're just ... old.

    Usually the web site wants the Windows team update the dialog to match the newest styling's because the dialog is "wrong".

    Whenever someone asks (or more often insists) that the Windows team update their particular old dialog, I sometimes want to turn around and ask them a question:

    "You get to choose: You can get this dialog fixed OR you can cut a feature from Windows, you can't get both.  Which feature in Windows would you cut to change this dialog?"

    Perhaps an automotive analogy would help explain my rather intemperate reaction:

    One of the roads near my house is a cement road and the road is starting to develop a fair number of cracks in it.  The folks living near the road got upset at the condition of the road and started a petition drive to get the county to repair the road.  Their petition worked and county came out a couple of weeks later and inspected the road and rendered their verdict on the repair (paraphrasing):  We've looked at the road surface and it is 60% degraded.  The threshold for immediate repairs on county roads is 80% degradation.  Your road was built 30 years ago and cement roads in this area have a 40 year expected lifespan.  Since the road doesn't meet our threshold for immediate repair and it hasn't met the end of its lifespan, we can't justify moving this section of road up ahead of the hundreds of other sections of road that need immediate repair.

    In other words, the county had a limited budget for road repairs and there were a lot of other sections of road in the county that were in a lot worse shape than the one near my house.

    The same thing happens in Windows - there are thousands of features in Windows and a limited number of developers who can change those features.   Changing a dialog does not happen for free.  It takes time for the developers to fix UI bugs.  As an example, I just checked in a fix for a particularly tricky UI bug.  I started working on that fix in early October and it's now January.

    Remember, this dialog works just fine, it's just a visual inconsistency.  But it's going to take a developer some amount of time to fix the dialog.  Maybe it's only one day.  Maybe it's a week.  Maybe the fix requires coordination between multiple people (for example, changing an icon usually requires the time of both a developer AND a graphic designer).  That time could be spent working on fixing other bugs.  Every feature team goes through a triage process on incoming bugs to decide which bugs they should fix.  They make choices based on their limited budget (there are n developers on the team, there are m bugs to fix, each bug takes t time to fix on average, that means we need to fix (m*t)/n bugs before we can ship).

    Fixing theming bug like this takes time that could be spent fixing other bugs.  And (as I've said before) the dialog does work correctly, it's just outdated.

    So again I come back to the question: "Is fixing a working but ugly dialog really more important than all the other bugs?"  It's unfortunate but you have to make a choice.

     

    PS: Just because we have to make choices like this doesn't mean that you shouldn't send feedback like this.   Just like the neighbors complaining to the county about the road, it helps to let the relevant team know about the issue. Feedback like this is invaluable for the Windows team (that's what the "Send Feedback" link is there for after all).  Even if the team decides not to fix a particular bug in this release it doesn't mean that it won't be fixed in the next release.

  • Larry Osterman's WebLog

    Engineering 7: A view from the bottom

    • 8 Comments

    About 2 months ago, Steven Sinofsky and Jon DeVaan started the “Engineering Windows 7” blog.  The instant I saw the blog, I wanted to contribute to the blog (because I love writing :)).

    I spent a fair amount of time thinking about what to write about and realized that one thing that wasn’t likely to be discussed was how the actual software engineering process of Windows 7 worked – not the data behind particular features, but how the hard core engineering work was managed.  So I wrote it and submitted it to Steven and Jon.

     

    My article (it’s too long to be considered a “post”) went live on the Engineering 7 blog sometime last night.

     

    Enjoy!

  • Larry Osterman's WebLog

    Resilience is NOT necessarily a good thing

    • 66 Comments

    I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

    What really caught my eye was his opening paragraph:

    I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

    Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

     

    I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

    I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

    I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

     

    But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

    The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

     

    -------

    [1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

     

    Edit: Cleaned up wording in the footnote.

  • Larry Osterman's WebLog

    When you're analyzing the strength of a password, make sure you know what's done with it.

    • 20 Comments

    Every once in a while, I hear someone making comments about the strength of things like long passwords.

    For example, if you have a 255 character password that just uses the 26 roman upper and lower case letters, plus the numeric digits.  That means that your password has 62^255 possible values, if you can try a million million passwords per second, the time required would exceed the heat death of the universe.

     

    Wow, that's cool - it means that you can never break my password if I use a long enough password.

     

    Except...

    The odds are very good that something in the system's going to take your password and apply a one-way hash to that password - after all, it wouldn't do to keep that password lying around in clear text where an attacker could see it.  But the instant you take a hash of a secret, the strength of the secret degrades to the strength of the hash.

    It's another example of the pigeonhole principle in practice - if you put N+M items into N slots, you're going to have some slots with more than one entry.  The pigeonhole principle applies in this case as well.

     

    In other words, if the password database that holds your password uses a hash algorithm like SHA-1, your 62^255 possible character password just got reduced in strength to a 256^20 possible value hash[1]. That means that any analysis that you've done on your password doesn't matter, because all an attacker needs to do is to find a different password that hashes to the same value as your password and they've broken your password.  Since your password strength exceeds the strength of the hash code, you know that there MUST be a collision with a weaker password.

     

    The bottom line is that when you're calculating the strength of a  password, it's important that you understand what your password looks like to an attacker.  If your password is saved as an SHA-1 or MD5 hash, that's the true maximum strength of your password.

     

    [1]To be fair, 256^20 is something like 1.4E48, so even if you could still try a million million passwords per second, you're still looking at something like a million million years to brute force that database, but 256^20 is still far less than 62^255.

  • Larry Osterman's WebLog

    Some final thoughts on Threat Modeling...

    • 16 Comments

    I want to wrap up the threat modeling posts with a summary and some comments on the entire process.  Yeah, I know I should have done this last week, but I got distracted :). 

    First, a summary of the threat modeling posts:

    Part 1: Threat Modeling, Once again.  In which our narrator introduces the idea of a threat model diagram

    Part 2: Threat Modeling Again. Drawing the Diagram.  In which our narrator introduces the diagram for the PlaySound API

    Part 3: Threat Modeling Again, Stride.  Introducing the various STRIDE categories.

    Part 4: Threat Modeling Again, Stride Mitigations.  Discussing various mitigations for the STRIDE categories.

    Part 5: Threat Modeling Again, What does STRIDE have to do with threat modeling?  The relationship between STRIDE and diagram elements.

    Part 6: Threat Modeling Again, STRIDE per Element.  In which the concept of STRIDE/Element is discussed.

    Part 7: Threat Modeling Again, Threat Modeling PlaySound.  Which enumerates the threats against the PlaySound API.

    Part 8: Threat Modeling Again, Analyzing the threats to PlaySound.  In which the threat modeling analysis work against the threats to PlaySound is performed.

    Part 9: Threat Modeling Again, Pulling the threat model together.  Which describes the narrative structure of a threat model.

    Part 10: Threat Modeling Again, Presenting the PlaySound threat model.  Which doesn't need a pithy summary, because the title describes what it is.

    Part 11: Threat Modeling Again, Threat Modeling in Practice.  Presenting the threat model diagrams for a real-world security problem .[1]

    Part 12: Threat Modeling Again, Threat Modeling and the firefoxurl issue. Analyzing the real-world problem from the standpoint of threat modeling.

    Part 13: Threat Modeling Again, Threat Modeling Rules of Thumb.  A document with some useful rules of thumb to consider when threat modeling.

     

    Remember that threat modeling is an analysis tool. You threat model to identify threats to your component, which then lets you know where you need to concentrate your resources.  Maybe you need to encrypt a particular data channel to protect it from snooping.  Maybe you need to change the ACLs on a data store to ensure that an attacker can't modify the contents of the store.  Maybe you just need to carefully validate the contents of the store before you read it.  The threat modeling process tells you where to look and gives you suggestions about what to look for, but it doesn't solve the problem.  It might be that the only thing that comes out from your threat modeling process is a document that says "We don't care about any of the threats to this component".  That's ok, at a minimum, it means that you considered the threats and decided that they were acceptable.

    The threat modeling process is also a living process. I'm 100% certain that 2 years from now, we're going to be doing threat modeling differently from the way that we do it today.  Experience has shown that every time we apply threat modeling to a product, we realize new things about the process of performing threat modeling, and find new, more efficient ways of going about the process.   Even now, the various teams involved with threat modeling in my division have proposed new changes the process based on the experiences of our current round of threat modeling.  Some of them will be adopted as best practices across Microsoft, some of them will be dropped on the floor. 

     

    What I've described over these posts is the process of threat modeling as it's done today in the Windows division at Microsoft.  Other divisions use threat modeling differently - the threat landscape for Windows is different from the threat landscape for SQL Server and Exchange, which is different from the threat landscape for the various Live products, and it's entirely different for our internal IT processes.  All of these groups use threat modeling, and they use the core mechanisms in similar ways, but because each group that does threat modeling has different threats and different risks, the process plays out differently for each team.

    If your team decides to adopt threat modeling, you need to consider how it applies to your components and adopt the process accordingly.  Threat Modeling is absolutely not a one-size-fits-all process, but it IS an invaluable tool.

     

    EDIT TO ADD: Adam Shostak on the Threat Modeling Team at Microsoft pointed out that the threat modeling team has a developer position open.  You can find more information about the position by going to here:  http://members.microsoft.com/careers/search/default.aspx and searching for job #207443.

    [1] Someone posting a comment on Bruce Schneier's blog took me for task for using a browser vulnerability.  I chose that particular vulnerability because it was the first that came to mind.  I could have just as easily picked the DMG loading logic in OSX or the .ANI file code in Windows for examples (actually the DMG file issues are in several ways far more interesting than the firefoxurl issue - the .ANI file issue is actually relatively boring from a threat modeling standpoint).

  • Larry Osterman's WebLog

    What's wrong with this code, part 21 - A Psychic Debugging Example - The answers.

    • 10 Comments

    So for the past couple of posts, I've been walking through a psychic debugging experience I had over the weekend.

    As I presented the problem, there were three pieces of information needed to debug the problem.

    An interface:

    class IPsychicInterface
    {
    public:
        virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
    };

    A test application:

    int _tmain(int argc, _TCHAR* argv[])
    {
        register int value1 = 1;
        IPsychicInterface *psychicInterface = GetPsychicInterface();
        register int value2 = 2;

        psychicInterface->DoSomeOperation(argc, argv);
        assert(value1 == 1);
        assert(value2 == 2);
        return 0;
    }

    and some assembly language code:

    0040106B  pop         edi 
    0040106C  pop         ebx 
    0040106D  mov         al,1
    0040106F  pop         esi 
    00401070  ret         0Ch 

    As I mentioned in my last post, the problem was tracked down to a stack imbalance when calling the DoSomeOperation method, and when I saw the postamble for DoSomeOperation, I quickly realized the answer to the problem.

    There are essentially 4 separate calling conventions supported by Microsoft's compilers - it turns out that you can figure out several of them from just looking at the code.  For the stdcall and thiscall calling conventions, input parameters are passed onto the routine, and the callee is responsible for cleaning up the stack (this contrasts with the cdecl calling convention where the caller is responsible for cleaning the stack).  From the postamble, we know that this function is either a stdcall or a thiscall function, since the "ret" instruction adjusts the stack.

    I've already stated that this is x86 code, and the RET 0CH indicates that the routine pops off 12 bytes of values off the stack.  This is clearly a problem, because the DoSomeOperation routine only takes two parameters (which would take 8 bytes). The RET 0CH implies that the implementation of DoSomeOperation took 3 parameters!

     

    This implies that we're dealing with a violation of the one definition rule (ODR).  The One Definition Rule is a part of the C++ standard (section 3.2) which states: "No translation unit shall contain more than one definition of any variable, function, class type, enumeration type or template.".  In other words, when you declare a function in separate object files, you need to make sure that they all use the same definitions of structures. 

    Most commonly ODR violations this show up when you change a header file but don't rebuild all the source files that depend on that file - there's a ton of work that's been done to automatically manage dependencies to avoid this particular issue.

    And if you look at the source code for the PsychicInterface logic, you'll see the problem immediately:

    class IPsychicInterface
    {
    public:
        virtual bool DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[]) = 0;
    };

    bool CPsychicInterface::DoSomeOperation(int argc, _TCHAR *argv[], _TCHAR *envp[])
    {
        int count = argc;
        while (count--)
        {
            printf("%S", argv[count]);
        }
        return true;
    }

    The PsychicInterface code has it's own private definition of IPsychicInterface which doesn't match the definition in the test application.

    Obviously this is an utterly contrived example.  The real problem was much more complicated than this - the violation was in an export from a DLL, and involved external components, which made this more complicated.  In many ways, it was similar to the problem that Raymond talked about here (except in this case, we're in a position to fix the code involved).

  • Larry Osterman's WebLog

    What's wrong with this code, Part 21 - A psychic debugging example: The missing piece

    • 18 Comments

    As I mentioned yesterday, one of the other developers in my group had hit a sticky problem, and he asked me for my opinion on what was going wrong.

    There were 3 pieces of information that I needed to use to diagnose the problem, I gave you two of them yesterday:

    The interface:

    class IPsychicInterface
    {
    public:
        virtual bool DoSomeOperation(int argc, _TCHAR *argv[]) = 0;
    };

    And the test application:

    int _tmain(int argc, _TCHAR* argv[])
    {
        register int value1 = 1;
        IPsychicInterface *psychicInterface = GetPsychicInterface();
        register int value2 = 2;

        psychicInterface->DoSomeOperation(argc, argv);
        assert(value1 == 1);
        assert(value2 == 2);
        return 0;
    }

    Originally the problem was that the ESI register was being trashed.  Since the C and C++ calling convention requires that the ESI register be preserved and the ESI register was trashed, that narrowed down the failure to three possible causes to the problem:

    • Somewhere inside DoSomeOperation, there was a stack overflow that caused the saved version of ESI to be corrupted.  This was actually my first thought.
    • Somewhere inside DoSomeOperation, there was a stack imbalance, which would cause garbage to be restored when the ESI register was popped off the stack.  Normally the compiler catches these errors, so I originally discounted this possibility.
    • There was a horrible compiler bug or OS bug which caused the register to be trashed (which is extraordinarily unlikely (but has happened)).

    The other developer had chased the problem down further and realized that there was a stack imbalance on the call to DoSomeOperation.  There are basically two things that can cause a stack imbalance, most of the people who left comments in the original post caught one of them, some caught the other:

    • A calling convention mismatch.
    • A parameter declaration mismatch.

    But I didn't have enough information to figure out which of the two it was.  That's when he gave me the final piece that let me accurately figure out what was going wrong.

    The final piece was the last bit of assembly language in the DoSomeOperation function:

    0040106B  pop         edi 
    0040106C  pop         ebx 
    0040106D  mov         al,1
    0040106F  pop         esi 
    00401070  ret         0Ch 

    Now that you have the last piece, what was the bug?  Be specific - we already know that the problem is a stack imbalance, but what's the root cause?

    For a bonus, why didn't the compiler catch it?

  • Larry Osterman's WebLog

    Threat Modeling Again, Threat Modeling Rules of Thumb

    • 12 Comments

    I wrote this piece up for our group as we entered the most recent round of threat models.  I've cleaned it up a bit (removing some Microsoft-specific stuff), and there's stuff that's been talked about before, but the rest of the document is pretty relevant. 

     

    ---------------------------------------

    As you go about filling in the threat model threat list, it’s important to consider the consequences of entering threats and mitigations.  While it can be easy to find threats, it is important to realize that all threats have real-world consequences for the development team.

    At the end of the day, this process is about ensuring that our customer’s machines aren’t compromised. When we’re deciding which threats need mitigation, we concentrate our efforts on those where the attacker can cause real damage.

     

    When we’re threat modeling, we should ensure that we’ve identified as many of the potential threats as possible (even if you think they’re trivial). At a minimum, the threats we list that we chose to ignore will remain in the document to provide guidance for the future. 

     

    Remember that the feature team can always decide that we’re ok with accepting the risk of a particular threat (subject to the SDL security review process). But we want to make sure that we mitigate the right issues.

    To help you guide your thinking about what kinds of threats deserve mitigation, here are some rules of thumb that you can use while performing your threat modeling.

    1. If the data hasn’t crossed a trust boundary, you don’t really care about it.

    2. If the threat requires that the attacker is ALREADY running code on the client at your privilege level, you don’t really care about it.

    3. If your code runs with any elevated privileges (even if your code runs in a restricted svchost instance) you need to be concerned.

    4. If your code invalidates assumptions made by other entities, you need to be concerned.

    5. If your code listens on the network, you need to be concerned.

    6. If your code retrieves information from the internet, you need to be concerned.

    7. If your code deals with data that came from a file, you need to be concerned (these last two are the inverses of rule #1).

    8. If your code is marked as safe for scripting or safe for initialization, you need to be REALLY concerned.

     

    Let’s take each of these in turn, because there are some subtle distinctions that need to be called out.

    If the data hasn’t crossed a trust boundary, you don’t really care about it.

    For example, consider the case where a hostile application passes bogus parameters into our API. In that case, the hostile application lives within the same trust boundary as the application, so you can simply certify the threat. The same thing applies to window messages that you receive. In general, it’s not useful to enumerate threats within a trust boundary. [Editors Note: Yesterday, David LeBlanc wrote an article about this very issue - I 100% agree with what he says there.] 

    But there’s a caveat (of course there’s a caveat, there’s ALWAYS a caveat). Just because your threat model diagram doesn't have a trust boundary on it, it doesn't mean that the data being validated hasn't crossed a trust boundary on the way to your code.

    Consider the case of an application that takes a file name from the network and passes that filename into your API. And further consider the case where your API has an input validation bug that causes a buffer overflow. In that case, it’s YOUR responsibility to fix the buffer overflow – an attacker can use the innocent application to exploit your code. Before you dismiss this issue as being unlikely, consider CVE-2007-3670. The Firefox web browser allows the user to execute scripts passed in on the command line, and registered a URI handler named “firefoxurl” with the OS with the start action being “firefox.exe %1” (this is a simplification). The attacker simply included a “firefoxurl:<javascript>” in a URL and was able to successfully take ownership of the client machine. In this case, the firefox browser assumed that there was no trust boundary between firefox.exe and the invoker, but it didn’t realize that it introduced such a trust boundary when it created the “firefoxurl” URI handler.

    If the threat requires that the attacker is ALREADY running code on the client at your privilege level, you don’t really care about it.

    For example, consider the case where a hostile application writes values into a registry key that’s read by your component. Writing those keys requires that there be some application currently running code on the client, which requires that the bad guy first be able to get code to run on the client box.

    While the threats associated with this are real, it’s not that big a problem and you can probably state that you aren’t concerned by those threats because they require that the bad guy run code on the box (see Immutable Law #1: “If a bad guy can persuade you to run his program on your computer, it’s not your computer anymore”).

    Please note that this item has a HUGE caveat: it ONLY applies if the attacker’s code is running at the same privilege level as your code. If that’s not the case, you have the next rule of thumb:

    If your code runs with any elevated privileges, you need to be concerned.

    We DO care about threats that cross privilege boundaries. That means that any data communication between an application and a service (which could be an RPC, it could be a registry value, it could be a shared memory region) must be included in the threat model.

    Even if you’re running in a low privilege service account, you still may be attacked – one of the privileges that all services get is the SE_IMPERSONATE_NAME privilege. This is actually one of the more dangerous privileges on the system because it can allow a patient attacker to take over the entire box. Ken “Skywing” Johnson wrote about this in a couple of posts on his blog (1 and 2) on his excellent blog Nynaeve. David LeBlanc has a subtly different take on this issue (see here), but the reality is that both David and Ken agree more than they disagree on this issue. If your code runs as a service, you MUST assume that you’re running with elevated privileges. This applies to all data read – rule #2 (requiring an attacker to run code) does not apply when you cross privilege levels, because the attacker could be writing code under a low privilege account to enable an elevation of privilege attack.

    In addition, if your component has a use scenario that involves running the component elevated, you also need to consider that in your threat modeling.

    If your code invalidates assumptions made by other entities, you need to be concerned

    The reason that the firefoxurl problem listed above was such a big deal was that the firefoxurl handler invalidated some of the assumptions made by the other components of Firefox. When the Firefox team threat modeled firefox, they made the assumption that Firefox would only be invoked in the context of the user.  As such it was totally reasonable to add support for executing scripts passed in on the command line (see rule of thumb #1).  However, when they threat modeled the firefoxurl: URI handler implementation, they didn’t consider that they had now introduced a trust boundary between the invoker of Firefox and the Firefox executable.  

    So you need to be aware of the assumptions of all of your related components and ensure that you’re not changing those assumptions. If you are, you need to ensure that your change doesn’t introduce issues.

    If your code retrieves information from the internet, you need to be concerned

    The internet is a totally untrusted resource (no duh). But this has profound consequences when threat modeling. All data received from the Internet MUST be treated as totally untrusted and must be subject to strict validation.

    If your code deals with data that came from a file, then you need to be concerned.

    In the previous section, I talked about data received over the internet. Microsoft has issued several bulletins this year that required an attacker tricking a user into downloading a specially crafted file over the internet; as a consequence, ANY file data must be treated as potentially malicious. For example, MS07-047 (a vulnerability in WMP) required that the attacker force the user to view a specially crafted WMP skin. The consequence of this is that that ANY file parsed by our code MUST be treated as coming from a lower level of trust.

    Every single file parser MUST treat its input as totally untrusted –MS07-047 is only one example of an MSRC vulnerability, there have been others. Any code that reads data from a file MUST validate the contents. It also means that we need to work to ensure that we have fuzzing in place to validate our mitigations.

    And the problem goes beyond file parsers directly. Any data that can possibly be read from a file cannot be trusted. <A senior developer in our division> brings up the example of a codec as a perfect example. The file parser parses the container and determines that the container isn't corrupted. It then extracts the format information and finds the appropriate codec for that format. The parser then loads the codec and hands the format information and file data to the codec.

    The only thing that the codec knows is that the format information that’s been passed in is valid. That’s it. Beyond the fact that the format information is of an appropriate size and has a verifiable type, the codec can make no assumptions about the contents of the format information, and it can make no assumptions about the file data. Even though the codec doesn’t explicitly parse the file, it’s still dealing with untrusted data read from the file.

    If your code is marked as “Safe For Scripting” or “Safe for Initialization”, you need to be REALLY concerned.

    If your code is marked as “Safe For Scripting” (or if your code can be invoked from a control that is marked as Safe For Scripting), it means that your code can be executed in the context of a web browser, and that in turn means that the bad guys are going to go after your code. There have been way too many MSRC bulletins about issues with ActiveX controls.

    Please note that some of the issues with ActiveX controls can be quite subtle. For instance, in MS02-032 we had to issue an MSRC fix because one of the APIs exposed by the WMP OCX returned a different error code if a path passed into the API was a file or if it was a directory – that constituted an Information Disclosure vulnerability and an attacker could use it to map out the contents of the users hard disk.

    In conclusion

    Vista raised the security bar for attackers significantly. As Vista adoption spreads, attackers will be forced to find new ways to exploit our code. That means that it’s more and more important to ensure that we do a good job ensuring that they have as few opportunities as possible to make life difficult for our customers.  The threat modeling process helps us understand the risks associated with our features and understand where we need to look for potential issues.

  • Larry Osterman's WebLog

    Threat Modeling Again, Threat modeling and the fIrefoxurl issue.

    • 26 Comments

    Yesterday I presented my version of the diagrams for Firefox's command line handler and the IE/URLMON's URL handler.  To refresh, here they are again:

     Here's my version of Firefox's diagram:

     And my version of IE/URLMON's URL handler diagram:

     

    As  I mentioned yesterday, even though there's a trust boundary between the user and Firefox, my interpretation of the original design for the Firefox command line parsing says that this is an acceptable risk[1], since there is nothing that the user can specify via the chrome engine that they can't do from the command line.  In the threat model for the Firefox command line parsing, this assumption should be called out, since it's important.

     

    Now let's think about what happens when you add in the firefoxurl URL handler to the mix?

     

    For that, you need to go to the IE/URLMON diagram.  There's a clear trust boundary between the web page and IE/URLMON.  That trust boundary applies to all of the data passed in via the URL, and all of the data should be considered "tainted".  If your URL handler is registered using the "shell" key, then IE passes the URL to the shell, which launches the program listed in the "command" verb replacing the %1 value in the command verb with the URL specified (see this for more info)[2].  If, on the other hand, you've registered an asynchronous protocol handler, then IE/URLMON will instantiate your COM object and will give you the ability to validate the incoming URL and to change how IE/URLMON treats the URL.  Jesper discusses this in his post "Blocking the FIrefox".

    The key thing to consider is that if you use the "shell" registration mechanism (which is significantly easier than using the asynchronous protocol handler mechanism), IE/URLMON is going to pass that tainted data to your application on the command line.

     

    Since the firefoxurl URL handler used the "shell" registration mechanism, it means that the URL from the internet is going to be passed to Firefox's command line handler.  But this violates the assumption that the Firefox command line handler made - they assume that their command line was authored with the same level of trust as the user invoking firefox.  And that's a problem, because now you have a mechanism for any internet site to execute code on the browser client with the privileges of the user.

     

    How would a complete threat model have shown that there was an issue?  The Firefox command line threat model showed that there was a potential issue, and the threat analysis of that potential issue showed that the threat was an accepted risk.

    When the firefoxurl feature was added, the threat model analysis of that feature should have looked similar to the IE/URLMON threat model I called out above - IE/URLMON took the URL from the internet, passed it through the shell and handed it to Firefox (URL Handler above).  

     

    So how would threat modeling have helped to find the bug?

    There are two possible things that could have happened next.  When the firefoxurl handler team[3] analyzed their threat model, they would have realized that they were passing high risk data (all data from the internet should be treated as untrusted) to the command line of the Firefox application.  That should have immediately raised red flags because of the risk associated with the data.

    At this point in their analysis, the foxurl handler team needed to confirm that their behavior was safe, which they could do either by asking someone on the Firefox command line handling team or by consulting the Firefox command line handling threat model (or both).  At that point, they would have discovered the important assumption I mentioned above, and they would have realized that they had a problem that needed to be mitigated (the actual form of the mitigation doesn't matter - I believe that the Firefox command line handling team removed their assumption, but I honestly don't know (and it doesn't matter for the purposes of this discussion)).

     

    As I mentioned in my previous post, I love this example because it dramatically shows how threat modeling can help solve real world security issues.

    I don't believe that anything in my analysis above is contrived - the issues I called out above directly follow from the threat modeling process I've outlined in the earlier posts. 

    I've been involved in the threat modeling process here at Microsoft for quite some time now, and I've seen the threat model analysis process find this kind of issue again and again.  The threat model either exposes areas where a team needs to be concerned about their inputs or it forces teams to ask questions about their assumptions, which in turn exposes potential issues like this one (or confirms that in fact there is no issue that needs to be mitigated).

     

    Next: Threat Modeling Rules of thumb.

     

    [1] Obviously, I'm not a contributor to Firefox and as such any and all of my comments about Firefox's design and architecture are at best informed guesses.  I'd love it if someone who works on Firefox or has contributed to the security analysis of Firefox would correct any mistakes I'm making here.

    [2] Apparently IE/URLMON doesn't URLEncode the string that it hands to the URL handler - I don't know why it does that (probably for compatibility reasons), but that isn't actually relevant to this discussion (especially since all versions of Firefox before 2.0.0.6, seem to have had the same behavior as IE).  Even if IE had URL encoded the URL before handing it to the handler, Firefox is still being handed untrusted input which violates a critical assumption made by the Firefox command line handler developers.

    [3] Btw, I'm using the term "team" loosely.  It's entirely possible that the same one individual did both the Firefox command line handling work AND the firefoxurl protocol handler - it doesn't actually matter.

  • Larry Osterman's WebLog

    Threat Modeling Again, Threat Modeling in Practice

    • 11 Comments

    I've been writing a LOT about threat modeling recently but one of the things I haven't talked about is the practical value of the threat modeling process.

    Here at Microsoft, we've totally drunk the threat modeling cool-aid.  One of Adam Shostak's papers on threat modeling has the following quote from Michael Howard:

    "If we had our hands tied behind our backs (we don't) and could do only one thing to improve software security... we would do threat modeling every day of the week."

    I want to talk about a real-world example of a security problem where threat modeling would have hopefully avoided a potential problem.

    I happen to love this problem, because it does a really good job of showing how the evolution of complicated systems can introduce unexpected security problems.  The particular issue I'm talking about is known as CVE-2007-3670.  I seriously recommend people go to the CVE site and read the references to the problem, they provide a excellent background on the problem.

    CVE-2007-3670 describes a vulnerability in the Mozilla Firefox browser that uses Internet Explorer as an exploit vector. There's been a TON written about this particular issue (see the references on the CVE page for most of the discussion), I don't want to go into the pros and cons of whether or not this is an IE or a FireFox bug.  I only want to discuss this particular issue from a threat modeling standpoint.

    There are four components involved in this vulnerability, each with their own threat model:

    • The Firefox browser.
    • Internet Explorer.
    • The "firefoxurl:" URI registration.
    • The Windows Shell (explorer).

    Each of the components in question play a part in the vulnerability.  Let's take them in turn.

    • The Firefox browser provides a command line argument "-chrome" which allows you to load the chrome specified at a particular location.
    • Internet Explorer provides an extensibility mechanism which allows 3rd parties to register specific URI handlers.
    • The "firefoxurl:" URL registration, which uses the simplest form of URL handler registration which simply instructs the shell to execute "<firefoxpath>\firefox.exe -url "%1" -requestPending".  Apparently this was added to Firefox to allow web site authors to force the user to use Firefox when viewing a link.  I believe the "-url" switch (which isn't included in the list of firefox command line arguments above) instructs firefox to treat the contents of %1 as a URL.
    • The Windows Shell which passes on the command line to the firefox application.

    I'm going to attempt to draw the relevant part of the diagrams for IE and Firefox.  These are just my interpretations of what is happening, it's entirely possible that the dataflow is different in real life.

    Firefox:

    image

    This diagram shows the flow of control from the user into Firefox (remember: I'm JUST diagramming a small part of the actual component diagram).  One of the things that makes Firefox's chrome engine so attractive is that it's easy to modify the chrome because the Firefox chrome is simply javascript.  Since the javascript being run runs with the same privileges as the current user, this isn't a big deal - there's no opportunity for elevation of privilege there.  But there is one important thing to remember here: Firefox has a security assumption that the -chrome command switch is only provided by the user - because it executes javascript with full trust, it's effectively accepts executable code from the command line.

     

    Internet Explorer:

    image

    This diagram describes my interpretation of how IE (actually urlmon.dll in this case) handles incoming URLs.  It's just my interpretation, based on the information contained here (at a minimum, I suspect it's missing some trust boundaries).  The web page hands IE a URL, IE looks the URL up in the registry and retrieves a URL handler.  Depending on how the URL handler was registered, IE either invokes the shell on the path portion of the URL, or, if the URL handler was registered as an async protocol hander, it hands the URL to the async protocol handler.

    I'm not going to do a diagram for the firefoxurl handler or the shell, since they're either not interesting or are covered in the diagram above - in the firefoxurl handler case, the firefoxurl handler is registered as being handled by the shell.  In that case,  Internet Explorer will pass the URL into the shell, which will happily pass it onto the URL handler (which, in this case is FireFox).

     

    That's a lot of text and pictures, tomorrow I'll discuss what I think went wrong and how using threat modeling could have avoided the issue.  I also want to look at BOTH of the threat models and see what they indicate.

     

    Obviously, the contents of this post are my own opinion and in no way reflect the opinions of Microsoft.

  • Larry Osterman's WebLog

    Threat Modeling Again, Presenting the PlaySound Threat Model

    • 7 Comments

    It's been a long path, but we're finally at the point where I can finally present the threat model for PlaySound.  None of the information in this post is new, all the information is pulled from previous posts.

     ----------------

    PlaySound Threat Model

    The PlaySound API is a high level multimedia API intended to render system sounds ("dings").  It has three major modes of operation:

    • It can play the contents of a .WAV file passed in as a parameter to the API.
    • It can play the contents of a Win32 resource or other memory location passed in as a parameter to the API.
    • It can play the contents of a .WAV file referenced by an alias.  If this mode is chosen, it reads the filename from the registry under HKCU\AppEvents.

     For more information on the PlaySound API and its options, see: The MSDN documentation for PlaySound.

    PlaySound Diagram

    The PlaySound API's data flow can be represented as follows.

    PlaySound Elements

     

    1. Application: External Interactor - The application which calls the PlaySound API.
    2. PlaySound: Process - The code that represents the PlaySound API
    3. WAV file: Data Store - The WAV file to be played, on disk or in memory
    4. HKCU Sound Aliases: Data Store - The Windows Registry under HKCU\AppEvents which maps from aliases to WAV filenames
    5. Audio Playback APIs: External Interactor - The audio playback APIs used for PlaySound.  This could be MediaFoundation, waveOutXxxx, DirectShow, or any other audio rendering system.
    6. PlaySound Command (Application->PlaySound): DataFlow (Crosses Threat Boundary) - The data transmitted in this data flow represents the filename to play, the alias to look up or the resource ID in the current executable to play.
    7. WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - The data transmitted in this data flow represents the WAVEFORMATEX structure contained in the WAV file being played.
    8. WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - The data transmitted in this data flow represents the actual audio samples contained in the WAV file being played.
    9. WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - The data transmitted in this data flow represents the contents of the HKCU\AppEvents\Schemes\.Default\<sound>\.Current[1]
    10. WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - The data transmitted in this data flow represents both the WAVEFORMATEX structure read from the WAV file and the audio samples read from the file.

    PlaySound Threat Analysis

    Data Flows

    WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Tampering
    WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Information Disclosure
    WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Denial of Service

    Because all the Data flows are all within a single process boundary, there are no meaningful threats to those dataflows - the Win32 process model protects against those threats.

    External Interactors

    Application: External Interactor - Spoofing

    It doesn't matter which application called the PlaySound API, so we don't care about spoofing threats to the application calling PlaySound.

    Application: External Interactor - Repudiation

    There are no requirements that the PlaySound API protect against Repudiation attacks.

    Audio Playback APIs: External Interactor - Spoofing

    The system default APIs are protected by Windows Resource Protection so they cannot be replaced.  If an attacker does successfully inject his logic (by overriding the COM registration for the audio APIs or by some other means, the attacker is running at the same privilege level as the user, so can do nothing that the user can't do.

    Audio Playback APIs: External Interactor - Repudiation

    There are no requirements that the PlaySound API protect against Repudiation attacks.

    Data Stores

    Since the data stores involved in the PlaySound API are under the control of the user, we must protect against threats to those data stores.

    WAV file: Data Store - Tampering

    An attacker can modify the contents of the WAV file data store.  To mitigate this attack, we will validate the WAVE header information; we're not going to check the actual WAV data, since it's just raw audio samples.  Bug #XXXX filed to validate this mitigation.

    WAV file: Data Store - Information Disclosure

    The WAV file is protected by NT's filesystem ACLs which prevent unauthorized users from reading the contents of the file.

    WAV file: Data Store - Repudiation

    Repudiation threats don't apply to this store.

    WAV file: Data Store - Denial of Service

    The PlaySound API will check for errors when reading from the store and will return an error indication to its caller (if possible). When PlaySound is running in the "resource or memory location" mode and the SND_ASYNC flag is specified, the caller may unmap the virtual memory associated with the WAV file.  In that case, the PlaySound may access violate while rendering the contents of the file[2].  Bug #XXXX filed to validate this mitigation.

    HKCU Sound Aliases: Data Store - Tampering

    An attacker can modify the contents of the sound aliases registry key.  To mitigate this attack, we will validate the contents of the key. Bug #XXXX filed to validate this mitigation.

    HKCU Sound Aliases: Data Store - Information Disclosure

    The aliases key is protected by the registry ACLs which prevent unauthorized users from reading the contents  of the key.

    HKCU Sound Aliases: Data Store - Repudiation

    Repudiation threats don't apply to this store.

    HKCU Sound Aliases: Data Store - Denial of service

    The PlaySound API will check for errors when reading from the store and will return an error indication to its caller (if possible).Bug #XXXX filed to validate this mitigation.

    Processes

    PlaySound: Process - Spoofing

    Since PlaySound is the component we're threat modeling, spoofing threats don't apply.

    PlaySound: Process - Tampering

    The only tampering that can happen to the PlaySound process directly involves modifying the PlaySound binary on disk, if the user has the rights to do that, we can't stop them.  For PlaySound, the file is protected by Windows Resource Protection, which should protect the file from tampering.

    PlaySound: Process - Repudiation

    We don't care about repudiation threats to the PlaySound API.

    PlaySound: Process - Information Disclosure

    The NT process model prevents any unauthorized entity from reading the process memory associated with the Win32 process playing Audio, so this threat is out of scope for this component.

    PlaySound: Process - Denial of Service

    Again, the NT process model prevents unauthorized entities from crashing or interfering with the process, so this threat is out of scope for this component.

    PlaySound: Process - Elevation of Privilege

    The PlaySound API runs at the same privilege level as the application calling PlaySound, so it is not subject to EoP threats.

    PlaySound: Process - "PlaySound Command" crosses trust boundary: Elevation of Privilege/Denial of Service / Tampering

    The data transmitted by the incoming "PlaySound Command" data flow comes from an untrusted source.  Thus the PlaySound API will validate the data contained in that dataflow for "reasonableness" (mostly checking to ensure that the string passed in doesn't cause a buffer overflow).  Bug #XXXX filed to validated this mitigation.

    PlaySound: Process - "WAV file Data" data flow crosses trust boundary: Information Disclosure

    It's possible that the contents of the WAV file might be private, so if some attacker can somehow "snoop" the contents of the data they might be able to learn information they shouldn't.  Another way that this "I" attack shows up is described in CVE-2007-0675 and here.  So how do we mitigate that threat (and the corresponding threat associated with someone spoofing the audio APIs)?

    The risk associated with CVE-2007-0675 is out-of-scope for this component (if the threat is to be mitigated, it's more appropriate to handle that either in the speech recognition engine or the audio stack), so the only risk is that we might be handing the audio stack data that can be misused. 

    Since the entire APIs purpose is to play the contents of the WAVE file, this particular threat is considered to be an acceptable risk.

    ---

    [1] The actual path is slightly more complicated because of SND_APPLICATION flag, but that doesn't materially change the threat model.

    [2] The DOS issues associated with this behavior are accepted risks.

    --------------

    Next: Let's look at a slightly more interesting case where threat modeling exposes an issue.

  • Larry Osterman's WebLog

    Threat Modeling Again, Pulling the threat model together

    • 9 Comments

    So I've been writing a LOT of posts about the threat modeling process and how one goes about doing the threat model analysis for a component.

    The one thing I've not talked about is what a threat model actually is.

    A threat model is a specification, just like your functional specification (a Program Management spec that defines the functional requirements of your component), your design specification (a development spec that defines the architecture that is required to implement the functional specification), and your test plan (a test spec that defines how you plan on ensuring that the design as implemented meets the requirements of the functional specification).

    Just like the functional, design and test specs, a threat model is a living document - as you change the design, you need to go back and update your threat model to see if any new threats have arisen since you started.

    So what goes into the threat model document?

    • Obviously you need the diagram and an enumeration and description of the elements in your diagram. 
    • You also need to include your threat analysis, since that's the core of the threat model.
    • For each mitigated threat that you call out in the threat analysis, you should include the bug # associated with the mitigation
    • You should probably have a one or two paragraph description of your component and what it does (it helps an outsider to understand your diagram), similarly, having a list of contacts for questions, etc are also quite useful.

    The third item I called out there reflects an important point about threat modeling that's often lost.

    Every time your threat model indicates that you have a need to mitigate a particular threat, you need to file at least one bug and potentially two.  The first bug goes to the developer to ensure that the developer implements the mitigation called out in the threat model, and the second bug goes to a tester to ensure that the tester either (a) writes tests to verify the mitigation or (b) runs existing tests to ensure that the mitigation is in place.

    This last bit is really important.  If you're not going to follow through on the process and ensure that the threats that you identified are mitigated, then your just wasting your time doing the threat model - except as an intellectual exercise, it won't actually help you improve the security of your product.

     

    Next: Presenting the PlaySound threat model!

  • Larry Osterman's WebLog

    Threat Modeling Again, Threat Modeling PlaySound

    • 7 Comments

    Finally it's time to think about threat modeling the PlaySound API.

    Let's go back to the DFD that I included in my earlier post, since everything flows from the DFD.

     

    This dataflow diagram contains a number of elements, they are:

    1. Application: External Interactor
    2. PlaySound: Process
    3. WAV file: Data Store
    4. HKCU Sound Aliases: Data Store
    5. Audio Playback APIs: External Interactor
    6. PlaySound Command (Application->PlaySound): DataFlow (Crosses Threat Boundary)
    7. WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary)
    8. WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary)
    9. WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary)
    10. WAV file Data (PlaySound -> Audio Playback APIs): DataFlow

    Now that we've enumerated the elements, we apply the STRIDE/Element methodology, which allows us to enumerate the threats that this component faces:

    1.  Application: External Interactor - Spoofing
    2.  Application: External Interactor - Repudiation
    3. PlaySound: Process - Spoofing
    4. PlaySound: Process - Tampering
    5. PlaySound: Process - Repudiation
    6. PlaySound: Process - Information Disclosure
    7. PlaySound: Process - Denial of Service
    8. PlaySound: Process - Elevation of Privilege
    9. WAV file: Data Store - Tampering
    10. WAV file: Data Store - Information Disclosure
    11. WAV file: Data Store - Repudiation
    12. WAV file: Data Store - Denial of Service
    13. HKCU Sound Aliases: Data Store - Tampering
    14. HKCU Sound Aliases: Data Store - Information Disclosure
    15. HKCU Sound Aliases: Data Store - Repudiation
    16. HKCU Sound Aliases: Data Store - Denial of service
    17. Audio Playback APIs: External Interactor - Spoofing
    18. Audio Playback APIs: External Interactor - Repudiation
    19. PlaySound Command (Application->PlaySound): DataFlow (Crosses Threat Boundary) - Tampering
    20. PlaySound Command (Application->PlaySound): DataFlow (Crosses Threat Boundary) - Information Disclosure
    21. PlaySound Command (Application->PlaySound): DataFlow (Crosses Threat Boundary) - Denial of Service
    22. WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    23. WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    24. WAVE Header (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    25. WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    26. WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    27. WAV file Data (WAV file-> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    28. WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Tampering
    29. WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Information Disclosure
    30. WAV filename (HKCU Sound Aliases -> PlaySound) : DataFlow (Crosses Threat Boundary) - Denial of Service
    31. WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Tampering
    32. WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Information Disclosure
    33. WAV file Data (PlaySound -> Audio Playback APIs): DataFlow - Denial of Service

     Phew.  You mean that the PlaySound API can be attacked in 33 different ways?  That's unbelievable.

    It's true.  There ARE 33 ways that you can attack the PlaySound API, however many of them are identical, and some of which are irrelevant.  That's the challenge of the next part of the process, which is the analysis phase.

    As I mentioned in the first STRIDE-per-element post, STRIDE-per-element is a framework for analysis.  That's where common sense and your understanding of the system comes into focus.

    And that's the next part in the series: Analyzing the threats enumerated by STRIDE-per-element.  This is the point at which all the previous articles come together.

  • Larry Osterman's WebLog

    Threat Modeling Again, What does STRIDE have to do with threat modeling?

    • 5 Comments

    In my last couple of posts, I've talked about the STRIDE categories.  As I mentioned, STRIDE provides a convenient classification mechanism for threats, and threat modeling is all about trying to identify the threats to your feature/component/whatever.

    When we first started threat modeling, we already had the idea of STRIDE categories, but we really didn't know how to apply them.  We'd go into the big threat modeling meeting and look at each of the pieces of our diagram and ask "what is the spoofing (or tampering, or whatever) threat against this component"?  We were thinking about the STRIDE categories as discrete elements, not as categories in which to collect threats.

    After a while, it became obvious that not only doesn't this work (again, it's very adhoc), but it's missing the point.  The point is to identify the threats and put them in the appropriate bucket so you can help to understand how to mitigate the threat.

    One of the interesting aspects of threats is that they are permanent.  For a given design, the threats against that design are static, for any data flow diagram, you have a static set of threats that apply to that data flow diagram.  There may be more than one threat in a particular category for a particular element, but every element is subject to certain threats.

    Once we had this mindset shift, we started thinking about how the STRIDE categories applied to various elements, and we came to an interesting realization.

    It turns out that some STRIDE threats only apply to particular types of elements.  If you think about it, it makes sense - for instance, an Elevation of Privilege threat doesn't apply to data stores (since a data store simply holds data, it operates at no privilege level).

     

    Remember that we consider four types of elements in a threat model: External Entities, Processes, Data Stores and Data Flows.  For each element type, the following threats are considered valid:

    External Entities: Spoofing, Repudiation.  Since an external entity could be anything, including the human being interacting with the component, Tampering, Information Disclosure, Denial of Service and Elevation of Privilege threats don't really make sense).  On the other hand, you can absolutely spoof a human being, and human beings can repudiate operations.

    Processes: Processes are subject to all of the STRIDE threats (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, and Elevation of Privilege).

    Data Stores: Tampering, Information Disclosure, Denial of Service (as I mentioned above, EoP etc don't really apply to static stores), and repudiation.

    Data Flows: Tampering, Information Disclosure, Denial of Service. 

    Repudiation threats against data stores require special mention.  Data stores often come under attack to allow for a repudiation attack to work (if you have a log located in a data store, the attacker might try to flood the data store with log entries to enable a repudiation attack.  In addition, logs held in data stores are almost always the mitigation against a repudiation threat.

    And with that final realization, all the pieces have been brought together to describe Microsoft's current methodology for threat modeling, which we call STRIDE-per-element.

    Next: STRIDE-per-element.

  • Larry Osterman's WebLog

    Threat Modeling Again, STRIDE Mitigations

    • 14 Comments

    I described the 6 STRIDE categories the other day.  In that post, I mentioned that there are "well understood" mitigations for each of the STRIDE categories.  Of course this list isn't exhaustive, many of these are obvious, and some don't apply, but when you're looking at providing mitigations to the threats that your threat modeling discovers, these mitigations provide a good place to start looking.

    Spoofing

    As I mentioned the other day, a spoofing attack occurs when an attacker pretends to be someone they're not.  So how do you stop a spoofing attack?  You require authentication (yeah, I did say that some of these are obvious:)).  Authentication takes many forms, there are a boatload of authentication mechanisms available (basic auth, digest, kerberos, PKI systems, IPSEC, etc).  Most of these apply to data transferred over the wire, but there are other mechanisms to ensure validity.  For instance, the Authenticode mechanism provides a way of validating that code has been signed.  Sometimes authentication isn't the right mitigation.  For instance, if you data flow diagram has a client DLL that is making an RPC into a service that you own, an attacker can spoof the client DLL - they can generate the RPC calls directly from their code bypassing your client DLL.  The mitigation for that type of attack is to add additional validation of the data transferred by the RPC in the server.

    Tampering

    Again, tampering attacks occur when the attacker modifies data in transit.  The standard mitigations for tampering attacks include digital signatures and message authentication codes.  Those work great for data transmitted on the wire, and are also valid for data stored in files on the disk.  One other mitigation for Tampering attacks are ACLs - for instance if only administrators need to write to a file or registry key, ACL it so that only administrators can write to the file/key.  Another way is validation of input read from the data source.  You need to be careful in this case to make sure that the validation doesn't introduce the possibility of a DoS attack (we had a bug in an early beta of Windows Vista where a corrupted registry key could prevent the audio service from starting - we had validation which correctly detected that a particular key was corrupted and failed to start because of it).

    Repudiation

    The standard mitigations for repudiation attacks include secure logs and audit records, coupled with using a strong authentication mechanism. 

    Information disclosure

    Information Disclosure attacks occur when the bad guy can see stuff they're not supposed to be able to see.  Standard mitigations include encryption, especially for data transmitted on the wire - for example, RPC provides a fairly robust encryption mechanism if you specify the RPC_C_AUTHN_LEVEL_PKT flag when establishing an RPC connection.  Other mitigations include ACLs (again).

    Denial of service

    It can be difficult to mitigate some classes of DoS attacks, but again, there are mechanisms that can mitigate many of the classes of DoS attacks.  For instance, you can use ACLs (again) to protect the contents of files from being removed or modified (which also protects against tampering attacks), you can use firewall filter rules (both internal and external) to protect against some network based attacks, you can use disk and processor quotas to prevent excess disk or CPU consumption.  In addition, there are design patterns that allow for high availability even in the face of active attackers (you'd have to ask server people for details, but they DO exist.

    Elevation of privilege

    To mitigate against EoP attacks, once again, you can use ACLs and other forms of permission checks.  But (IMHO) by far the most effective source of protection against EoP attacks is input validation - if the input is verified to be correct, it's harder to cause problems (not impossible, but harder).  On the other hand, you also need to be very careful about your validation logic - it's quite easy to get it wrong.

     

    As I said at the beginning of this discussion, these are just rough outlines.  Many of them don't apply.  Since I'm working on building the PlaySound threat model, I'll take two examples from that threat model:

    • For the PlaySound API, repudiation threats aren't particularly applicable.  As such, Repudiation threats are considered to be an acceptable risk.
    • Tampering threats aren't particularly relevant to any of the data flows, because they're all in-proc.  The only way that an attacker could manipulate the data flows is if they had injected code into the current process, and in order for them to do that, they need to be running at either the same or a higher privilege level - the Win32 process object model protects us from those threats.

     

    Next: How do we use STRIDE?

Page 1 of 8 (181 items) 12345»