Adam Shostack here.
I said recently that I wanted to talk more about what I do. The core of what I do is help Microsoft’s product teams analyze the security of their designs by threat modeling. So I’m very concerned about how well we threat model, and how to help folks I work with do it better. I’d like to start that by talking about some of the things that make the design analysis process difficult, then what we’ve done to address those things. As each team starts a new product cycle, they have to decide how much time to spend on the tasks that are involved in security. There’s competition for the time and attention of various people within a product team. Human nature is that if a process is easy or rewarding, people will spend time on it. If it’s not, they’ll do as little of it as they can get away with. So the process evolves, because, unlike Dr No, we want to be aligned with what our product groups and customers want
There have been a lot of variants of things called “threat modeling processes” at Microsoft, and a lot more in the wide world. People sometimes want to argue because they think Microsoft uses the term “threat modeling” differently than the rest of the world. This is only a little accurate. There is a community which uses questions like “what’s your threat model” to mean “which attackers are you trying to stop?” Microsoft uses threat model to mean “which attacks are you trying to stop?” There are other communities whose use is more like ours. In this paragraph, I’m attempting to mitigate a denial of service threat, where prescriptivists try to drag us into a long discussion of how we’re using words.) The processes I’m critiquing here are the versions of threat modeling that are presented in Writing Secure Code, Threat Modeling, and The Security Development Lifecycle books.
In this first post of a series on threat modeling, I’m going to talk a lot about problems we had in the past. In the next posts, I’ll talk about what the process looks like today, and why we’ve made the changes we’ve made. I want to be really clear that I’m not critiquing the people who have been threat modeling, or their work. A lot of people have put a tremendous amount of work in, and gotten some good results. There are all sorts of issues that our customers will never experience because of that work. I am critiquing the processes, saying we can do better, in places we are doing better, and I intend to ensure we continue to do better.
We ask feature teams to participate in threat modeling, rather than having a central team of security experts develop threat models. There’s a large trade-off associated with this choice. The benefit is that everyone thinks about security early. The cost is that we have to be very prescriptive in how we advise people to approach the problem. Some people are great at “think like an attacker,” but others have trouble. Even for the people who are good at it, putting a process in place is great for coverage, assurance and reproducibility. But the experts don’t expose the cracks in a process in the same way as asking everyone to participate.
Getting Started
The first problem with ‘the threat modeling process’ is that there are a lot of processes. People, eager to threat model, had a number of TM processes to choose from, which led to confusion. If you’re a security expert, you might be able to select the right process. If you’re not, judging and analyzing the processes might be a lot like analyzing cancer treatments. Drugs? Radiation? Surgery? It’s scary, complex, and the wrong choice might lead to a lot of unnecessary pain. You want expert advice, and you want the experts to agree.
Most of the threat modeling processes previously taught at Microsoft were long and complex, having as many as 11 steps. That’s a lot of steps to remember. There are steps which are much easier if you’re an expert who understands the process. For example, ‘asset enumeration.’ Let’s say you’re threat modeling the GDI graphics library. What are the assets that GDI owns? A security expert might be able to answer the question, but anyone else will come to a screeching halt, and be unable to judge if they can skip this step and come back to it. (I’ll come back to the effects of this in a later post.)
I wasn’t around when the processes were created, and I don’t think there’s a lot of value in digging deeply into precisely how it got where it is. I believe the core issue is that people tried to bring proven techniques to a large audience, and didn’t catch some of the problems as the audience changed from experts to novices.
The final problem people ran into as they tried to get started was an overload of jargon, and terms imported from security. We toss around terms like repudiation as if everyone should know what it means, and sometimes implied they’re stupid if they don’t. (Repudiation is claiming that you didn’t do something. For example, “I didn’t write that email!,” “I don’t know what got into me last night!” You can repudiate something you really did, and you can repudiate something you didn’t do.) Using jargon sent several unfortunate messages:
Of course, that wasn’t the intent, but it often was the effect.
The Disconnected Process
Another set of problems is that threat modeling can feel disconnected from the development process. The extreme programming folks are fond of only doing what they need to do to ship, and Microsoft shipped code without threat models for a long time. The further something is from the process of building code, the less likely it is to be complete and up to date. That problem was made worse because there weren’t a lot of people who would say “let me see the threat model for that.” So there wasn’t a lot of pressure to keep threat models up to date, even if teams had done a good job up front with them. There may be more pressure with other specs which are used by a broader set of people during development.
ValidationOnce a team had started threat modeling, they had trouble knowing if they were doing a good job. Had they done enough? Was their threat model a good representation of the work they had done, or were planning to do? When we asked people to draw diagrams, we didn’t tell them when they could stop, or what details didn’t matter. When we asked them to brainstorm about threats, we didn’t guide them as to how many they should find. When they found threats, what were they supposed to do about them? This was easier when there was an expert in the room to provide advice on how to mitigate the threat effectively. How should they track them? Threats aren’t quite bugs—you can never remove a threat, only mitigate it. So perhaps it didn’t make sense to track them like that, but that left threats in a limbo.
"Return on Investment"
The time invested often didn’t seem like it was paying off. Sometimes it really didn’t pay off. (David LeBlanc makes this point forcefully in “Threat Modeling the Bold Button is Boring”) Sometimes it just felt that way—Larry Osterman made that point, unintentionally in “Threat Modeling Again, Presenting the PlaySound Threat Model,” where he said “Let's look at a slightly more interesting case where threat modeling exposes an issue.” Youch! But as I wrote in a comment on that post, “What you've been doing here is walking through a lot of possibilities. Some of those turn out to be uninteresting, and we learn something. Others (as we've discussed in email) were pretty clearly uninteresting” It can be important to walk through those possibilities so we know they’re uninteresting. Of course, we’d like to reduce the time it takes to look at each uninteresting issue.
Other ProblemsLarry Osterman lays out some other reasons threat modeling is hard in a blog post: http://blogs.msdn.com/larryosterman/archive/2007/08/30/threat-modeling-once-again.aspx
One thing that was realized very early on is that our early efforts at threat modeling were quite ad-hoc. We sat in a room and said "Hmm, what might the bad guys do to attack our product?" It turns out that this isn't actually a BAD way of going about threat modeling, and if that's all you do, you're way better off than you were if you'd done nothing. Why doesn't it work? There are a couple of reasons: It takes a special mindset to think like a bad guy. Not everyone can switch into that mindset. For instance, I can't think of the number of times I had to tell developers on my team "It doesn't matter that you've checked the value on the client, you still need to check it on the server because the client that's talking to your server might not be your code.". Developers tend to think in terms of what a customer needs. But many times, the things that make things really cool for a customer provide a superhighway for the bad guy to attack your code. It's ad-hoc. Microsoft asks every single developer and program manager to threat model (because they're the ones who know what the code is doing). Unfortunately that means that they're not experts on threat modeling. Providing structure helps avoid mistakes.
One thing that was realized very early on is that our early efforts at threat modeling were quite ad-hoc. We sat in a room and said "Hmm, what might the bad guys do to attack our product?" It turns out that this isn't actually a BAD way of going about threat modeling, and if that's all you do, you're way better off than you were if you'd done nothing.
Why doesn't it work? There are a couple of reasons:
It takes a special mindset to think like a bad guy. Not everyone can switch into that mindset. For instance, I can't think of the number of times I had to tell developers on my team "It doesn't matter that you've checked the value on the client, you still need to check it on the server because the client that's talking to your server might not be your code.".
Developers tend to think in terms of what a customer needs. But many times, the things that make things really cool for a customer provide a superhighway for the bad guy to attack your code.
It's ad-hoc. Microsoft asks every single developer and program manager to threat model (because they're the ones who know what the code is doing). Unfortunately that means that they're not experts on threat modeling. Providing structure helps avoid mistakes.
With all these problems, we still threat model, because it pays dividends. In the next posts, I’ll talk about what we’ve done to improve things, what the process looks like now, and perhaps a bit about what it might look like either in the future, or adopted by other organizations.
Scott Lambert here. I work on the Security Engineering Tools team where we're responsible for researching, developing and publishing tools to internal product and service teams. These include fuzzing, binary analysis and attack surface analysis tools.
Previously, James Whittaker posted a blog entry on Testing in the SDL in which he mentioned that many folks equate fuzz testing with security testing. While fuzz testing doesn't come close to describing how security testing is done at Microsoft it does happen to be one of our most scalable testing approaches to detecting program failures that may have security implications.
As Michael Howard has pointed out before, we do our best to ensure that the SDL incorporates lessons learned from vulnerabilities that required us to release security updates. It turns out that the animated cursor bug patched in MS07-017 had a positive impact on the automatic triaging our fuzz testing tools perform. In this post, I'd like to shed some light on how we monitor for program failures when fuzzing parsers and how the recent animated cursor bug, MS07-017 caused us to revisit and ultimately improve our fuzzing tools.
Background
For our purposes, fuzz testing is a method for finding program failures (code errors) by supplying malformed input data to program interfaces (entry points) that parse and consume this data (e.g. file, network, registry, shared memory parsers). At Microsoft, we view fuzz testing as six distinct stages in which the output of each stage can impact or influence both the current and next iteration through the stages (e.g. after completing analysis work in stage 5 you could decide to change how you malform and deliver fuzzed data [stage 2 and 3], which exceptions get logged [stage 4], which tests you re-run [stage 6] and even which parsers you might decide to go after next [stage 1], etc). Below is a brief listing of each stage and its associated tasks.
Stage 1: Prerequisites
Stage 2: Creation of fuzzed data (malformed data)
Stage 3: Delivery of fuzzed data to the application under test
Stage 4: Monitoring of application under test for signs of failure
Stage 5: Triaging Results
Stage 6: Identify root cause, fix bugs, rerun failures, analyze coverage data (rinse and repeat)
How we do file fuzzing
There are a number of approaches taken by product teams to meet the SDL file fuzzing requirements. They often include the use of generation and mutation-based fuzzers as well as a combination of multiple internal and externally available fuzzing tools and/or frameworks.
When fuzzing file parsers, we monitor for both handled and unhandled exceptions in the application under test. Exceptions are events that typically represent error conditions encountered during the execution of an application. They can be generated both by the hardware (initiated by the CPU) and/or software (initiated by the executing program or the OS). To monitor for these exceptions, we created a mini-debugger using the Win32 Debugging APIs (For an example of how to integrate a debugger into your fuzz testing tool, check out Michael Howard and Steve Lipner's SDL Book at http://www.microsoft.com/MSPress/books/8753.asp). The mini-debugger launches the application under test and monitors the parent and all subsequent child processes and associated threads. When an exception occurred, the first version of this tool simply logged the file that caused the exception along with associated details such as the timestamp, exception code, exception address, stack trace and dump file. More recent versions have included the ability to monitor for CPU and memory spikes as well as enabling full page heap settings on all processes launched from the mini-debugger.
As a general rule, all exceptions must be triaged (reviewed) by the tester to determine if a bug needs to be filed. When fuzzing over a period of time however, we might generate hundreds of exceptions and it becomes a very labor-intensive process to sift through all of them. What we needed was a way to ease the burden placed on the tester.
To that extent, the mini-debugger was extended to enable the automatic "bucketization" of logged exceptions to reduce the chance of having to look at duplicates during the triaging process. This was accomplished by creating unique bucket ids calculated from the stack trace using both symbols and offset when the information is available. The bucket id was used to name a folder that was created in the file system to refer to a unique application exception. When an exception occurred, we calculated a hash (bucket id) of the stack trace and determined if we had already seen this exception. If so, we logged the associated details in a sub-directory under the bucket id folder to which the exception belonged. The sub-directory name was created from the name of the fuzzed file that caused the exception. Thus, we were able to reduce the number of potential exceptions that a tester would have to look at during the triage process. It is often the case that certain exceptions are noisy and/or expected so we also added the ability for the tester to dampen exceptions by exception code. Dampening ensured that those exceptions were not logged (recorded) for triage during a fuzz run. Nonetheless, despite our best efforts it is still possible for two different stack traces to have the same underlying root cause.
Even with all of this automated assistance, the tester might still have several hundred cases to triage. In an effort to prioritize which cases should be triaged first, we introduced the notion of classifying exceptions. Again, we extended the mini-debugger to perform classification on the exception code and relevant details. In particular, we added an extra hierarchy over the automatically generated directory structure described above. To do this we introduced the following categories of exceptions:
I know what you're thinking, but remember that this classification doesn't exclude a tester from the requirement of having to triage all exceptions. The "Must Fix" category was composed of write access violations, read access violations on EIP, /GS and NX related access violations and read access violations where any one of the following was true*:
*Fully automating the classification of these cases is complex and almost always requires an entire execution trace. As such, teams are also provided with guidance to assist them during their analysis when our tool is unable to classify beyond "read and write access violations".
The "Further Investigation necessary" category was composed of read access violations that didn't meet the criteria above as well as other specific cases. Finally, the "Usually not exploitable" category was composed of other exceptions such as divide-by-zero, C++ exceptions and the like. Another thing to keep in mind is that the interpretation of "Usually not exploitable" is different for server-based components. In other words, a divide-by-zero exception in a server product is probably more than just a robustness issue...it might be a denial of service!
Remember that regardless of this classification the tester is still required to triage all exceptions and file bugs accordingly. I'll defer more details on the subject of exploitability of program failures to the upcoming annual security issue of MSDN Magazine in November.
To recap, we had a debugging plug-in (mini-debugger) that not only monitored for exceptions but also reduced the number of exceptions to triage after a fuzzing session was completed. This also included monitoring for CPU and memory spikes as well as the use of page heap to capture heap corruptions that might not manifest themselves as an application crash (exception) during the fuzz session. What could go wrong? Enter MS07-017. The software responsible for invoking the vulnerable code [to parse animated cursors] made use of an exception handler to recover from pretty much any exception that could be generated and continue operating as if nothing had occurred (Read more about it at http://blogs.msdn.com/sdl/archive/2007/04/26/lessons-learned-from-the-animated-cursor-security-bug.aspx).
The Animated Cursor bug caused us to revisit our mini-debugger. Why? Put simply, we hadn't introduced the "bucketization" and classification mechanisms for first-chance exceptions. Naturally, this meant the tester was back to square one in terms of having no assistance on the labor-intensive triaging process. To deal with the "recover from anything" exception handling code we introduced the concept of classifying and bucketing "dangerous" first chance exceptions to help reduce the number of first chance cases the tester would need to triage. This means we look for both write access violations and read access violations on EIP. Additionally, we added support to continue after a first chance exception, allowing exception handlers to be called and continue and possibly proceed on to other more interesting crashes.
As you can see fuzz testing scales pretty well, but simplifying and scaling the triage process is not an easy task. Even more challenging is the integration of technology into an effective lifecycle. We're constantly working with teams within Microsoft to further advance our tools, you can learn more by viewing http://research.microsoft.com/research/pubs/view.aspx?id=1333&type=Technical+Report and http://research.microsoft.com/Pex/.
-Scott Lambert
Hello all - Dave here...
Booz Allen Hamilton recently released a State-of-the-Art Report (SOAR) on Software Security Assurance on behalf of the Information Assurance Technology Analysis Center (IATAC); an analysis and consulting group sponsored by the US Department of Defense. I had seen the report before, but hadn’t had the time to dive into it as its nearly 400 pages long. However, I had to travel for Microsoft business recently, and there’s really nothing like a long plane ride to allow one to catch up on back reading!
Upon closer inspection, it is a fairly exhaustive work that seeks to provide a snapshot of the current state of the software security assurance field. I was pleased to note significant mentions of the SDL and of other work done by Microsoft. The report made some made some salient points about SDL – questioning its suitability for use in certain circumstances (e.g. policy-driven non-technical risk management scenarios). However, the authors were also quick to point out that SDL is a technical, software development process, and could be paired with other methods to meet government requirements.
For the record – we make no claims about the universal applicability of SDL – it’s a constantly evolving, security-focused software development process – first, last and always. While the SDL is well suited to our work environment, we might have made different process tradeoffs in other environments. The important thing to focus on is process evolution – learning from customer pain, decisions made, and effectiveness of what you're doing – and using that information as a catalyst for change.
As with any report, there are points on which reasonable people will differ – however, it does a reasonably good job at presenting “one-stop shopping” for information on software security assurance. It’s definitely worth a look.
I’d be interested in hearing other opinions...
I've been meaning to talk more about what I actually do, which is help the teams within Microsoft who are threat modeling (for our boxed software) to do their jobs better. Better means faster, cheaper or more effectively. There are good reasons to optimize for different points on that spectrum (of better/faster/cheaper) at different times in different products. One of the things that I've learned is that we ask a lot of developers, testers, and PMs here. They all have some exposure to security, but terms that I've been using for years are often new to them.
Larry Osterman is a longtime MS veteran, currently working in Windows audio. He's been a threat modeling advocate for years, and has been blogging a lot about our new processes, and describes in great detail the STRIDE per element process. His recent posts are "Threat Modeling, Once Again," "Threat modeling again. Drawing the diagram," "Threat Modeling Again: STRIDE," "Threat modeling again, STRIDE mitigations," "Threat modeling again, what does STRIDE have to do with threat modeling," "Threat modeling again, STRIDE per element," "Threat modeling again, threat modeling playsound."
I wanted to chime in and offer up this handy chart that we use. It's part of how we teach people to go from a diagram to a set of threats. We used to ask them to brainstorm, and have discovered that that works a lot better with some structure.
Property
Threat
Definition
Example
Authentication
Spoofing
Impersonating something or someone else.
Pretending to be any of billg, microsoft.com or ntdll.dll
Integrity
Tampering
Modifying data or code
Modifying a DLL on disk or DVD, or a packet as it traverses the LAN.
Non-repudiation
Repudiation
Claiming to have not performed an action.
“I didn’t send that email,” “I didn’t modify that file,” “I certainly didn’t visit that web site, dear!”
Confidentiality
Information Disclosure
Exposing information to someone not authorized to see it
Allowing someone to read the Windows source code; publishing a list of customers to a web site.
Availability
Denial of Service
Deny or degrade service to users
Crashing Windows or a web site, sending a packet and absorbing seconds of CPU time, or routing packets into a black hole.
Authorization
Elevation of Privilege
Gain capabilities without proper authorization
Allowing a remote internet user to run commands is the classic example, but going from a limited user to admin is also EoP.
[Update: fixed the table so it displays all four columns.]