Update 12/31/2012: I updated the images since and made one very minor edit (replace 'Google' with 'Bing' ;-) ). No other changes made.
Note: In a lame attempt to get Google Bing hits, I have replaced every second instance of the word "modelling" with the incorrectly spelt version "modeling" :-)
No, I'm not talking about writing a threat model for a large, furry ape (although that would be fun); I'm talking about writing quick-and-dirty threat models when you don't have time to do the real thing. If you want to do threat modelling properly, I highly recommend you read Frank and Window's "Threat Modeling" [sic] book from Microsoft Press; but if you just need to get one done, you might not have the time or inclination for that.
This should go hand-in-hand with my previous post.
What I'm going to recommend here is a process you can follow to quickly get a threat model done. It's also the basic process I tell teams inside Microsoft to follow when they bring me in on their threat model meetings, and it's written from the perspective of "These are the things I would be looking for if you asked me to review your threat model." It will also help assess whether you need to dig into more detailed threat models for individual components listed in the main threat model, or whether the component isn't really interesting enough (from a security perspective) to warrant more effort.
So if you happen to work at Microsoft and you think I'll be working with you on threat models in the future... keep on reading to learn how to make the process quick(er) and painless(er) :-) Also if you work outside of Microsoft and will be working either internally or with a security consultant, you might find this useful as well.
A quick note: This post is kind of selfish in nature... it's more about what I would want from a threat model (and, extrapolating a bit, it's probably what any kind of external reviewer would want from a threat model at any company) and I've written it mostly so I can point people at it in the future and say "read this before you invite me to your threat model reviews" (or perhaps so 3rd party consultants can do the same with their clients). It's important to note that threat models are vitally important for the product team itself, and you should absolutely invest the time and effort in reading the threat modelling book and going through the process properly with your team to ensure everyone gets the maximum benefit out of the threat model. The last thing we want is for people to think that threat modeling is just another "tax" they have to pay to ship cool software; it should be seen as a key part of the software development process that actually helps improve the quality of the product and is a great way to communicate the design to others.
There are several basic sections to a threat model; different people will name them differently, but to me they basically boil down to:
· Description: A short description of what the component does, and how "big" or "scary" it is
· Data Flow Diagrams (DFDs): One or more diagrams that show all internal / external entities involved in this threat model, the trust boundaries separating them, and the data flows between them
· Checks and Balances: Tables of information enumerating your entry points, trust levels, protected assets, etc.
· Threats: Table of enumerated threats against your component and how they are mitigated (any unmitigated threat is a vulnerability)
To go along with this, I'll do a (very basic) sample threat model of a physical bank teller at a bank branch.
The Description isn't always necessary; it just depends on how (un)familiar I am with the component / technology. If I'm helping to threat model Feature XYZ of Microsoft Excel it probably needs no introduction, but if your product team is releasing a brand-new product with some funky internal code name I probably won't have a clue. The point here is to get a high-level overview of how scared I should be about your product / feature; to see what kind of attack surface it has. Is it Notepad or is it IIS? Is it a rich-client app or a network service? What account does it run under? Does it run in user mode or in kernel mode? Is it restricted to fully trusted (native) code or does it allow partially-trusted callers? Does it run by default every time the box boots, or is it a utility that the user runs once in a blue moon? Basic things like that.
For the bank teller example, the salient points are that random people come in off the street and ask you for money, and you have to make sure you only give it to the right people. The description might be:
The Bank Teller component gives out money to clients. It accepts requests from arbitrary clients, performs authentication and authorisation activities, and then processes valid requests for cash withdrawals. The teller is trusted to perform these actions, but is required to work within the confines of various corporate policies with regards to transaction amounts, auditing of transactions, etc.
The DFDs are really where it's all at. As the saying goes, a picture is worth a thousand word and DFDs are no different. You can convey so much information in such a small amount of space / time using a good diagram that their utility cannot be overstated. A DFD is not a flow chart; it is a diagram showing flows of data between components. I could try and explain the differences, but since we already established that a picture is worth a thousand words, here's an example of a bad DFD for the bank teller (it's more like a flowchart):
A good DFD starts with a Context Diagram -- a diagram with your component in the middle and all the external entities it talks to on the outside. An example for the bank teller might be something like this:
Next, you start drilling into the details inside the component in a series of one or more Level N diagrams (where N gets higher the further you drill down), which might give you something like this (although I don't claim it is perfect!):
Here AuthN is short hand for Authenticate - making sure the person is who they really claim they are, and AuthZ is short hand for Authorise (but with the American spelling ;-) ) -- making sure the user has permission to do what they are attempting to do. These two abbreviations are commonly used at Microsoft (and possibly elsewhere) and they are also used to mean "Authentication" and "Authorisation" as well.
Some interesting syntax for DFDs (not all these things are present in the diagram above)
· Rectangles represent external entities that your process deals with, but that you aren't specifically studying in this threat model. Rectangles can have data flows going in, coming out, or both, and are typically labelled with nouns
· Circles represent the data processing sub-components of the component you are studying in this threat model. Circles must have both data going in and data going out, and are typically labelled with verbs
· Double circles represent components that you will break down into more detailed diagrams later on in the threat model. Otherwise, they are the same as normal circles
· Double lines represent data stores (logical or physical) accessed by the components in this threat model. Double lines can have data flows going in, coming out, or both, and are typically labelled with nouns
· Lines represent data flows within and between your component and the external entities. Data flows are always one-way (you can't use double-headed arrows) and are labelled with nouns
· Dotted lines represent trust boundaries, machine boundaries, process boundaries, or other "interesting" transitions, and are labelled with nouns
Circles must have data both coming in and data going out because their purpose is to process an input and produce an output. If you have a circle that only accepts data or only hands out data, it might be better modelled as a data store or as an external component that you will later model in a separate data flow diagram. Also note that you typically don't need to "ask" for data from a data store; you just retrieve it (ie, you only need an outbound flow from the store to the process).
I also like to use colours on DFDs, with hopefully obvious meaning (apologies to the colour blind):
· Red things are untrusted. Only rectangles, lines, or double lines can be red; circles can never be red
· Green things are trusted. Any kind of object on the diagram can be green
· Blue things are the components being modelled (optional; I tend to just mark them as green). If you use this convention, only circles can be blue
· Orange things are partially trusted (optional; really only useful for high-level diagrams that you will later drill down into). If you use this convention, only rectangles and lines can be orange
·In theory, you could also do something like make the lines thicker if they contained Personally Identifiable Information (PII), but I've never done that. Any colour line could be thick
Note that you can't have red circles on the diagram because that would indicate you don't even trust your own code! If you have two parts to your product at different trust levels (eg, a client component and a server component) then it is best to model them separately and have the "other" component be a red rectangle. Alternatively, if the components are very tightly coupled, you could model them both as green circles but then have a spoofing threat that the states the green thing might not be green after all (and mitigate appropriately). Note also that you can have untrusted data flows between trusted components (basically, one component is just passing the buck to another component) but you can never have trusted data flows between untrusted components :-).
It can also be a good idea to number your entities / process / data stores for reference later on, and to also number your data flows in the order in which they occur during a typical sequence, although in this case I haven't bothered to do so since the example is pretty basic. A more complicated example with more data flows might benefit from having them labelled with their sequence numbers to make it easier to follow.
The very first thing I will do when looking at a DFD is to check if it conforms to the above rules. If not, you need to fix it, but generally we can keep going with the review assuming the appropriate corrections will be made. (It's not enough that you and I both understand the threat model on the day of the meeting; someone new to the project six months or two years from now should be able to read the threat model and understand the system without having to know a whole bunch of gotchas).
Assuming the diagram is in reasonably good shape, the next thing to do is look at the semantics of the diagram:
·The more red things you have in the diagram, the scarier the component is. We'll focus on the red things first
·The further the red lines penetrate into your component, the scarier the component is. We'll look at the longest paths first
·Any green lines going out towards red components could be areas of interest (leaking information)
When you get to the point when it's just green components with green lines between them, all inside the same trust environment (ie, no boundaries between them), the diagram becomes relatively uninteresting. That's not to say you can't find bugs in such diagrams -- you can, especially when the internals of a component are complex and have many inter-dependences, and wherever possible you should implement defence-in-depth strategies can mitigate implementation bugs in externally-facing components -- but the really scary EoP / DoS type issues tend to be found where there is red ink on the page. If you have two DFDs in your threat model, one of which is all green and the other of which is half red, I know which DFD I'd want you to be spending your time on.
One common question I get asked is "How 'big' should the threat model be?" The traditional answer to this is that a threat model should cover a "component" (which of course leads to the next question, "How 'big' should a component be?"), and a "component" is generally considered to be a "system" or "feature" or "spec" or "binary" or "API" depending on the kind of product being threat modelled. This still isn't very helpful though, because on the one hand, you have the Bold feature in Word which probably has its own spec (but isn't really worthy of threat modelling), and on the other hand you have Excel.exe which is a single binary (but is incredibly complicated and needs way more than one threat model).
Yet another problem is that once you have a reasonable idea for what a "component" means for your product, you need to figure out whether all its data flows can be modelled in a single diagram, or whether you need to break them out into separate diagrams. For example, if you were threat modeling the File -> Open and File -> Save features of Notepad, would you put them on separate diagrams or could you put them both on the same diagram? My general advice here is that as a very rough rule of thumb, the diagram should have no more than half a dozen objects and no more than a dozen lines. Anything less than this, and you're probably causing too much work for yourself by breaking things out beyond what is reasonably necessary. Any more than this, and you end up with something that looks like a plate of spaghetti and it is impossible to analyse.
For example, here's something that is over-simplified and can probably be combined with some other data flows (or maybe doesn't even warrant a threat model at all):
And something that badly needs to be broken up:
Of course, sometimes processes really are that simple (or complicated) and there's not much else you can do. Oh well.
Once the diagram is done, it's time to turn to the Checks and Balances section, which will contain a list of your component's entry points, protected assets, and so on. Ideally (in my humble opinion) these tables should be prepared by someone who was NOT involved in creating the diagrams, or (if it must be the same person) it should be done at a different time than the diagrams. Why? Because it's a way of double-checking that you really have correctly modelled your system. By doing a "clean room" implementation, each person is not influenced by any mistakes / assumptions the other person may have made.
For example, in the teller case, let's say the "Entry Points" table consists of the following:
Customer on other side of booth
List of "known bad" account numbers
In this table, we can see that the first two items match up with the DFD, but the list of "known bad" account numbers doesn't appear on the diagram and the transaction log doesn't appear in the table. We might decide that the list of account numbers is simply part of the "Policies" data store, or if there are significant differences in the way normal polices are handled ("Deny any request to withdraw more than $500 ") versus how known-bad account numbers are handled ("Call the cops! ") we might need to add some more pieces to the DFD. In the case of the transaction log, it was just something that the person building the table forgot about (hence why it might be a good idea to have different people do the table and the diagram, although realistically they will probably both be built by a set of people from various disciplines). An updated view including the "known bad" accounts might look like this:
It might also be worth noting here what an "entry point" is exactly. Basically it is a way that someone (or something) can influence your component or be influenced by your component. Anything that sends / receives data to / from your component is an entry point, although exactly how we name them can be kind of arbitrary. For example, in the table above we list "the customer" as an entry point, whereas for a web server we might list "port 80" as the entry point (rather than "remote browser"). I guess it just sounds silly to list "Teller's eyes and ears" as an entry point in this case ;-)
In theory, you should be able to do this "cross check" between the diagram and the tables before getting someone like me involved to do the threat model review, because it doesn't really take any security expertise to ensure that they both correspond to each other. Nevertheless, if there are any discrepancies we'll go over those in the review and make sure that the correct set of entry points and other items are recorded in the threat model, adding and deleting from the tabes / diagrams as necessary. This should be a quick process; you don't want to spend a long time arguing about whether or not an entry point that one person thought of is valid or not in the threat modelling meeting if you can at all avoid it. Do it before the meeting!
One funny thing happens is when these things haven't quite been ironed out before the threat model review happens, and then people on the team start to disagree about how certain portions of the system work. Then you end up with funny exchanges like the following:
·PM: Well I designed the feature like this, and I've got the spec to prove it
·Dev: No, no, no! I built the feature like this, and I've got the code to prove it
·QA: Ha! You're both wrong -- the feature actually works like this, and I've got the test casesto prove it!
QA always wins these arguments, because you can't disagree with the running code! Now that doesn't necessarily mean the code is doing what it is supposed to be doing (ie, what the PM designed or the developer built), but at least it tells you how the code is working at the moment :-)
So once you have the DFD done and you've made sure that the tables all match up, the fun part is looking at threats (it is a threat model after all, right?). Looking at threats is a process of taking each object on the diagram (rectangles, circles, lines, etc.) and thinking about the kinds of things a "bad guy" could do to mess with your system -- to get access to your assets. "Assets" in this sense can include both traditional assets that the attacker might want to steal (such as confidential business information or electronic funds) or less tangible assets that they might want to co-opt away from you such as the process' address space itself (worms, spyware, etc. using your computing power to do evil). Attackers might not even want to "steal" your assets so much as they simply want to stop you from being able to access them (a "Denial of Service" attack) -- if an attacker can crash your web server, you can't take any orders from customers, and that can lead to problems for your business.
There are various ways to do this, but the two most popular are to look at each object in turn and brainstorm the attacks against it, or to go through each of the STRIDE categories in turn and brainstorm those threats against each object. (STRIDE stands for Spoofing, Tampering, Repudiation, Information disclosure, Denial of service, and Elevation of privilege, and these are the broad classes of threats that we are concerned about). Either of these approaches works fine, although I find that doing it object-by-object makes verification of the threat model easier later on (ie, it's easy to see "here are the threats for the first object, here are the threats for the second object..." and be able to quickly tell if anything was missed, versus having all the spoofing threats, then all the tampering threats, etc. and having to make sure each class of threats was applied to each applicable object type).
Note: Most people understand what all these things mean except for "Repudiation" -- it's a little-used term but it basically means "claiming you didn't do something that you actually did." You have a repudiation vulnerability if it's possible for somebody to perform an action in your system in such a way that you have no way of proving they did it. This allows people (including authorised people -- "the good guys") to perform malicious actions with impunity because there is no way for them to get caught.
So let's look at some threats to the sample bank teller DFD, starting with the top-left entity (the customer) and going through the STRIDE categories in turn. First, we'll look at the potential threats to the system without worrying about their impact or whether or not they are mitigated. This ensures people don't get bogged down in arguments about mitigations, and it also fosters an environment where people feel OK to throw out threats that they might think are "silly" or "trivial" when they could in fact be real bugs.
· Spoofing: A bad guy can pretend to be a customer that he is not (eg, Bob pretends to be Frank)
· Tampering: You can't really "tamper" with a person, although you could tamper with their intentions -- maybe the customer is being blackmailed into withdrawing all their money by a bad guy
· Repudiation: The bad guys can attack the system, probing for weaknesses without being identified (or with insufficient proof to get law enforcement involved)
· Information Disclosure: The bad guy might be able to learn things about the bank (or its customers), such as whether an account is valid, how much money it has, etc. This can be used for future attacks on the system. For example, if "Bob Smith" comes up and asks to withdraw $1,000 from account 12345, it would be bad for the teller to reply "Sorry, there is only $500 in that account" before even checking his ID -- now the attacker knows that the account is valid and how much money it has in it
· Denial of Service: A bad guy can use up teller resources by being "difficult" or making lots of withdrawal requests. A bad guy could also report someone else's account number as being "bad." If a bad guy can withdraw money that doesn't belong to them, they are DoS-ing the rightful owner
· Elevation of Privilege: If a bad guy pretends to be someone else, he can withdraw their money (something he shouldn't be allowed to do).
One after the other, you would go through each object on the diagram and look at what threats are relevant to it. Often you will find that the same threats appear at multiple stages of the diagram; that's OK, just keep going. Only when all the threats are written down should you look at mitigations. In this example, the customer is clearly the most interesting thing to look at because they are on the "wrong" side of the trust boundary, but there are still interesting things to look at in the other processes. What if the policy file is tampered with? What if the teller is colluding with a customer to commit fraud? What if the phone line to the police station is cut? And so on. Any mitigations you come up with against threats "inside" the trust boundary will likely be defence-in-depth measures (at least for a component as simple as this) because, hey, you gotta trust somebody, right?
Anyway, for the threats against the customer, we have the following mitigations:
· Spoofing: Since the customer is untrusted to begin with, we don't really care if one untrusted entity is replaced with another untrusted entity (ie, until they are authenticated, all people are considered evil). If the customer tries to use a fake ID, they will be caught by the authentication process (thus it would be important to get the authentication process right!)
· Tampering: Customers being blackmailed or otherwise forced to withdraw money against their will are beyond the scope of this threat model. Nevertheless, large withdrawals (or withdrawals inconsistent with the customer's history) will be caught by the policy check and potentially through transaction log analysis after the fact.
· Repudiation: Ah, there is a legitimate bug here! At the moment, we only log successful transactions; an attacker can attempt to make as many unsuccessful transactions as they like without being caught. We need to add an audit step at each stage of the process.
· Information Disclosure: The mitigations here are implicit in the design, but since failure cases are not called out in the diagram it might not be obvious. At each stage of the process (authN, authZ, etc.) the teller will abort the process at the first sign of failure and will not give away any information. For example, we can see that the teller doesn't even look at the bank account information until the user is authenticated.
· Denial of Service: There is no real mitigation against "difficult" customers; part of the bank teller's job is to be polite and provide good service. It might be good to provide some escalation process though to move really difficult customers off to the manager. The process of adding entries to the "known bad list" is not covered in this process, but it would be a great threat to pass on to that team (and to list as one of the assumptions in this threat model). If the bad guy can withdraw someone else's money, that means all the other safeguards have been breached; there's nothing else that can be done here (this is kind of a meta-threat that basically says "assume the system fails catastrophically; what's the worst that could happen?")
· Elevation of Privilege: Pretending to be someone else is the same as the spoofing threat, so nothing new here
So now the diagram looks like this (note, you probably wouldn't re-draw the diagram during the threat modeling brainstorming meeting; that would be done after the fact):
An interesting thing to note here is that now we are logging all successful / failed attempts in the audit log, but we're not using that information. You might be inclined to add a check in the authentication phase that checks the audit log to see there has been a lot of abuse on the account -- for example, if a bad guy has come in three times in one day claiming to be "Bob Smith" with various forms of fake ID, you might want to know that do when he comes in for the fourth time you can immediately turn him away without even looking at the ID. This is pretty much the same as having an account lockout policy and in some situations it can hurt more than it helps because now the bad guy can stop the real Bob Smith from getting money out of his account by just going in three times in one day. The bad guy doesn't really intend to take money out of Bob's account (in fact he knows he will fail the authentication process) but he just wants to make it impossible for Bob to withdraw his cash. If you've seen as many movies as I have, you shouldn't have any problems coming up with scenarios where a bad guy might want to stop a good guy from accessing their own money. Instead, the audit information might be used to feed back into the policy system, or maybe some other kinds of controls will be added in the future, or maybe it will be submitted to law enforcement as proof of malicious activity by someone. Again, this is something you would list as an assumption or external dependency, and communicate to the appropriate group.
A tricky part of this addition to the diagram is the colour of the lines. The audit logs come from trusted processes (the authentication and authorisation processes) but the information they are logging isn't necessarily trustworthy in and of itself. For example, the authentication audit says:
"A guy claiming to be 'Bob Smith' came in at 11:15 am on 1/1/2005 but his ID didn't match so we failed the authentication. He was trying to withdraw $1,000 from account 12345"
In this case, the identity of the customer is not trustworthy information -- we don't really know who it was because authentication failed -- but we do trust that the authentication process is not lying about who the customer claimed to be. (Or, to put it another way: The customer may be lying, but the authentication process will faithfully report that lie to the audit log). Similarly, we don't know if the transaction was valid, but we know the attempt was correctly reported. Nevertheless, we do know that the result of the authentication (success or failure) is completely trustworthy. In an attempt to try to convey this information without overly complicating the diagram with more lines, I made up a new rule on the fly and made the data flow orange and colour-coded the label text.
Hmm, that discussion about auditing was a bit of a detour. Anyway, once you have your threats and mitigations (or lack of mitigations!) documented, are you done?
If the component is simple and there were no interesting trust boundaries or no interesting threats identified, you can probably just write up a simple threat model document and be done with it. But if the component is complex you might need to go into more details and do another set of reviews, or you could build threat trees for each of the identified threats, and so on. Threat trees are a topic for another day (this post is already too long!), but basically they are a way of visually displaying all the possible ways an attacker can reach their goal (such as "Spoof the legitimate customer 'Bob Smith' to withdraw his cash") and helping to define how the threats can be best mitigated at various stages of the attack.
But, as far as I'm (personally) concerned, the basic threat model is "done" and the threat trees are something for you and your product team to worry about; we've identified the threats and how they should be mitigated, and exactly how you do that is up to you. Obviously I will be willing to help build the threat trees and to give you my opinions on various mitigation strategies (whether you ask for them or not ;-) ) but, at least as far as "guerrilla threat modelling" is concerned, you're done for now.
This post wouldn't have been possible without all of the great work people at Microsoft are doing around threat modeling -- from people like Michael Howard and Frank & Window who are helping define the process itself, to all the teams I've worked with over the past few years who have helped me to understand the process better for myself and to figure out what works and what doesn't. This is truly one of those things that you just have to get the "feel" for and that you get good at with experience; practice makes perfect!