Welcome to MSDN Blogs Sign in | Join | Help

Not surprisingly, I'm the security contact for my small part of the Windows organization (it's called the Devices&Media group, which is within the WEX division).  As such, I'm responsible for providing security guidance and reviewing the threat models for our group (I've done a lot of them over the past few months :)).

Earlier this morning, one of the PMs for one of the teams in D&M stopped by my office with a thank you gift for the work I've done with his team.  He had noticed my 20 year old office tool kit and his team decided to replace it with something newer (and way cooler):

 

0515081331a

I sent them a private thank-you, but I've got to say publicly that I'm really touched - it was extraordinarily nice of them and I truly appreciate it.

 

 

PS: before anyone asks, the photo was taken with the toolkit resting on a test laptop (it's an old Toshiba M5).  In the background, you can see the Blibbet Hat I had made about 20 years ago at a local fair.

Apparently two years ago, someone ran a static analysis tool named "Valgrind" against the source code to OpenSSL in the Debian Linux distribution.  The Valgrind tool reported an issue with the OpenSSL package distributed by Debian, so the Debian team decided that they needed to fix this "security bug".

 

Unfortunately, the solution they chose to implement apparently removed all entropy from the OpenSSL random number generator.  As the OpenSSL team comments "Had Debian [contributed the patches to the package maintainers], we (the OpenSSL Team) would have fallen about laughing, and once we had got our breath back, told them what a terrible idea this was."

 

And it IS a terrible idea.  It means that for the past two years, all crypto done on Debian Linux distributions (and Debian derivatives like Ubuntu) has been done with a weak random number generator.  While this might seem to be geeky and esoteric, it's not.  It means that every cryptographic key that has been generated on a Debian or Ubuntu distribution needs to be recycled (after you pick up the fix).  If you don't, any data that was encrypted with the weak RNG can be easily decrypted.

 

Bruce Schneier has long said that cryptography is too important to be left to amateurs (I'm not sure of the exact quote, so I'm using a paraphrase).  That applies to all aspects of cryptography (including random number generators) - even tiny changes to algorithms can have profound effects on the security of the algorithm.   He's right - it's just too easy to get this stuff wrong.

 

The good news is that there IS a fix for the problem, users of Debian or Ubuntu should read the advisory and take whatever actions are necessary to protect their data.

I just ran into this post by Eric Brechner who is the director of Microsoft's Engineering Excellence center.

What really caught my eye was his opening paragraph:

I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it's better to crash and let Watson report the error than it is to catch the exception and try to correct it.

Wow.  I'm not going to mince words: What a profoundly stupid assertion to make.  Of course it's better to crash and let the OS handle the exception than to try to continue after an exception.

 

I have a HUGE issue with the concept that an application should catch exceptions[1] and attempt to correct them.  In my experience handling exceptions and attempting to continue is a recipe for disaster.  At best, it takes an easily debuggable problem into one that takes hours of debugging to resolve.  At it's worst, exception handling can either introduce security holes or render security mitigations irrelevant.

I have absolutely no problems with fail fast (which is what Eric suggests with his "Restart" option).  I think that restarting a process after the process crashes is a great idea (as long as you have a way to prevent crashes from spiraling out of control).  In Windows Vista, Microsoft built this functionality directly into the OS with the Restart Manager, if your application calls the RegisterApplicationRestart API, the OS will offer to restart your application if it crashes or is non responsive.  This concept also shows up in the service restart options in the ChangeServiceConfig2 API (if a service crashes, the OS will restart it if you've configured the OS to restart it).

I also agree with Eric's comment that asserts that cause crashes have no business living in production code, and I have no problems with asserts logging a failure and continuing (assuming that there's someone who is going to actually look at the log and can understand the contents of the log, otherwise the  logs just consume disk space). 

 

But I simply can't wrap my head around the idea that it's ok to catch exceptions and continue to run.  Back in the days of Windows 3.1 it might have been a good idea, but after the security fiascos of the early 2000s, any thoughts that you could continue to run after an exception has been thrown should have been removed forever.

The bottom line is that when an exception is thrown, your program is in an unknown state.  Attempting to continue in that unknown state is pointless and potentially extremely dangerous - you literally have no idea what's going on in your program.  Your best bet is to let the OS exception handler dump core and hopefully your customers will submit those crash dumps to you so you can post-mortem debug the problem.  Any other attempt at continuing is a recipe for disaster.

 

-------

[1] To be clear: I'm not necessarily talking about C++ exceptions here, just structured exceptions.  For some C++ and C# exceptions, it's ok to catch the exception and continue, assuming that you understand the root cause of the exception.  But if you don't know the exact cause of the exception you should never proceed.  For instance, if your binary tree class throws a "Tree Corrupt" exception, you really shouldn't continue to run, but if opening a file throws a "file not found" exception, it's likely to be ok.  For structured exceptions, I know of NO circumstance under which it is appropriate to continue running.

 

Edit: Cleaned up wording in the footnote.

Robert Hensing linked to a post by Thomas Ptacek over on the Matasano Chargen blog. Thomas (who is both a good hacker AND a good writer) has a writeup of a “game-over” vulnerability that was just published by Mark Dowd over at IBM's ISS X-Force that affects Flash. For those that don’t speak hacker-speak, in this case, a “game-over” vulnerability is one that can be easily weaponized (his techniques appear to be reliable and can be combined to run an arbitrary payload). As an added bonus, because it’s a vulnerability in Flash, it allows the attacker to write a cross-browser, cross-platform exploit – this puppy works just fine in both IE and Firefox (and potentially in Safari and Opera).

This vulnerability doesn’t affect Windows directly, but it DOES show how a determined attacker can take what was previously thought to be an unexploitable failure (a null pointer dereference) and turn it into something that can be used to 0wn the machine.

Every one of the “except not quite” issues that Thomas writes about in the article represented a stumbling block that the attacker (who had no access to the source to Flash) had to overcome – there are about 4 of them, but the attacker managed to overcome all of them.

This is seriously scary stuff.  People who have flash installed should run, not walk over to Adobe to pick up the update.  Please note that the security update comes with the following warning:

"Due to the possibility that these security enhancements and changes may impact existing Flash content, customers are advised to review this March 2008 Adobe Developer Center article to determine if the changes will affect their content, and to begin implementing necessary changes immediately to help ensure a seamless transition."

Edit2: It appears that the Adobe update center I linked to hasn't yet been updated with the fix, I followed their update proceedure, and my Flash plugin still had the vulnerable version number. 

Edit: Added a link to the relevant Adobe security advisory, thanks JD.

 

Michael Howard sent the following news article to one of our internal DL's this morning.  For some reason, I don't think it's going to hit the front page of Slashdot any time soon:

Serving as the latest reminder of that fact is Antioch University in Yellow Springs, Ohio, which recently disclosed that Social Security numbers and other personal data belonging to more than 60,000 students, former students and employees may have been compromised by multiple intrusions into its main ERP server.

The break-ins were discovered Feb. 13 and involved a Sun Solaris server that had not been patched against a previously disclosed FTP vulnerability, even though a fix was available for the flaw at the time of the breach, university CIO William Marshall said today.

                                                :

"When we went in and did a further investigation, we found that there was an IRC bot installed on the system," Marshall said.

So Antioch's Solaris systems were (a) compromised by an old vulnerability, and (b) were being used as botnet clients.  Both of which the slashdot crowd claim only happens on "Windoze" machines.

At what point do people pull their heads out of the sand and realize that computer security and patching disciplines are an industry-wide issue and not just a single platform issue?  Even after the Pwn2Own contest last month was won by a researcher who exploited a flash vulnerability, the vast majority of the people commenting on the ZDNet article claimed that the issue was somehow "windows only".  Ubuntu even published a blog post that claimed that they "won" (IMHO they didn't, because Shane has said that the only reason he chose not to attack the Ubuntu machine was that he was more familiar with Windows).  The reality is that nobody "wins" these contests (except maybe the security researcher who gets a shiny new computer at the end).  It's just a matter of time before the machine will get 0wned.

Ignoring stories like this make people believe that somehow security issues are isolated to a single platform, and that in turn leaves them vulnerable to hackers.  It's far better to acknowledge that the IT industry as a whole has an issue with security and ask how to move forwards.

 

Edit: Ubunto->Ubuntu (oops :))

Daniel just returned from a 10 day trip to Italy where his school chamber choir performed at the 2008 Choir International Festival in Verona.

 

One of the Choir parents just sent out an email pointing to two clips of the choir performing:

Dravidian Dithyr:

 

 

Wanting Memories (they cut off the beginning and the end of the song):

It's cool to see the choir on the web.

 

Dan Fernandez over on the Channel 9 team just let me know that one of my earlier videos was featured in their new Video Spam Filter intro.

It's weird - I hadn't realized how much I swore.

 

Go figure that one out :).

I don't write about the SDL very much, because I figure that the SDL team does a good enough job of it on their blog, but I was reading the news a while ago and realized that one of the aspects of the SDL would have helped if our competitors were to adopt it.

 

A long time ago, I wrote a short post about "giblets", and they're showing up a lot in the news lately.  "Giblets" are a term coined by Steve Lipner , and they've entered the lexicon of "micro-speak".  Essentially a giblet is a chunk of code that you've included from a 3rd party.  Michael Howard wrote about them on the SDL blog a while ago (early January), and now news comes out that Google's Android SDK contains giblets that contain known exploitable vulnerabilities

I find this vaguely humorous, and a bit troubling.  As I commented in my earlier post (almost 4 years ago), adding a giblet to your product carries with it the responsibility to monitor the security mailing lists to make sure that you're running the most recent (and presumably secure) version of the giblet.

What I found truly surprising was that Android development team had shipped code (even in beta) with those vulnerabilities.  Their development team should have known about the problem with giblets and never accepted the vulnerable versions in the first place.  That in turn leads me to wonder about the process management associated with the development of Android.

I fully understand that you need to lock down the components that are contained in your product during the development process, that's why fixes take time to propagate into distributions. As I've seen it from watching FOSS bugs, the typical lifecycle of a security bug in FOSS code is: A bug is typically found in the component, and fixed quickly.  Then over the next several months, the fix is propagated into the various distributions that contain the fix.  So a fix for the bug is made very quickly (but is completely untested), the teams that package up the distribution consumes the fix and proceeds to test the fix in the distribution.  As a result, distributions naturally lag behind fixes (btw, the MSFT security vulnerabilities follow roughly the same sequence - the fix is usually known within days of the bug being reported, but it takes time to test the fix to ensure that the fix doesn't break things (especially since Microsoft patches vulnerabilities in multiple platforms, the fix for all of them needs to be released simultaneously)).

But even so, it's surprising that a team would release a beta that contained a version of one of it's giblets that was almost 4 years old (according to the original report, it contained libPNG version 1.2.7, from September 12, 2004)!  This is especially true given the fact that the iPhone had a similar vulnerability found last year (ironically, the finder of this vulnerability was Travis Ormandy of Google).  I'm also not picking on Google because of spite - other vendors like Apple and Microsoft were each bitten by exactly this vulnerability - 3 years ago.  In Apple's case, they did EXACTLY the same thing that the Android team did: They released a phone that contained a 3 year old vulnerability that had previously been fixed in their mainstream operating system.

 

So how would the SDL have helped the Android team?  The SDL requires that you track giblets in your code - it forces you to have a plan to deal with the inevitable vulnerabilities in the giblets.  In this case, SDL would have forced the development teams to have a process in place to monitor the vulnerabilities (and of course to track the history of the component), so they hopefully would never have shipped vulnerable components.  It also means that when a vulnerability is found after shipping, they would have a plan in place to roll out a fix ASAP.  This latter is critically important because history has shown us that when one component is known to have a vulnerability, the vultures immediately swoop in to find similar vulnerabilities in related code bases (on the theory that if you make a mistake once, you're likely to make it a second or third time).  In fact, that's another requirement of the SDL: When a vulnerability is found in a component, the SDL requires that you also look for similar vulnerabilities in related code bases.

Yet another example where adopting the SDL would have helped to mitigate a vulnerability[1].

 

[1] Btw, I'm not saying that the SDL is the only way to solve this problem.  There absolutely are other methodologies that would allow these problems to be mitigated.  But when you're developing software that's going to be deployed connected to a network (any network), you MUST have a solution in place to manage your risk (and giblets are just one form of risk).  The SDL is Microsoft's way, and so far it's clearly shown its value.

Raymond sent me an email yesterday asking me to confirm an old Lan Manager slogan.

Back in the Lanman 2.0 days, Brian Valentine (who ran the Lanman group) made up a series of T-Shirts for the team with the words:

Lan Manager... We're back and we're BAD". 

I believe I still have one of those t-shirts.  It had a relatively snarky attitude, which I love (and why I loved working for Brian, he shared many of the same sentiments).  For the non-english speakers reading this, the use of "BAD" is an American idiom that means "nasty, in a really good way".

 

The reason that Raymond asked me the question was because of what apparently happened to the T-Shirt when it hit our international subsidiaries.  Not surprisingly, many of them wanted to print up their own version, but sometimes the results were... less than perfect.  According to one person, the Swedish had a hard time translating the "BAD" idiom.  So they apparently fell back on a literal translation of the slogan and printed up their own t-shirts, which said (translated back to english):

Lan Manager...  We are here again and we're not very good.

And now you know "The rest of the story(tm)"

A co-worker came by to ask what he thought was a coding "style" question that turned into a correctness issue, and I thought I'd share it.

 

Someone had defined two COM interfaces:

interface IFoo : IUnknown
{
    HRESULT FooMethod1();
    HRESULT FooMethod2();
}

They also had a factory interface IBar which had a method HRESULT GetFoo(IFoo **ppFoo).

 

As a result of new work, the team that owned IFoo wanted to extend IFoo.  To do this, they defined a new interface, IFoo2 that inherited from IFoo:

interface IFoo2 : IFoo
{
    HRESULT Foo2Method1();
    HRESULT Foo2Method2();
}

The team that owned IFoo (and IBar) decided that they didn't want to change the IBar interface to add a GetFoo2 method, feeling that the GetFoo method was "good enough".

My co-worker wanted to call GetFoo and cast the resulting IFoo object into an IFoo2 object (he knew that the IFoo he got was always going to be an IFoo2).  He was worried about the stylistic implications of doing the cast.

 

The problem with doing this turns out not to be a style issue, but instead to be correctness issue.  Here's the problem.

Somewhere under the covers, there's a class CFoo that implements IFoo and IFoo2.  This is the object that will be returned by the IBar::GetFoo method.

When the compiler lays out this object, the compiler will lay out the data for the class as follows:

1 CFoo vtable
2 IFoo vtable with 1*sizeof(void *) adjustor thunk
3 IFoo2 vtable with 2*sizeof(void *) adjustor thunk
4 CFoo member variable 1 storage
5 CFoo member variable 2 storage
6 :
: :

When IBar::GetFoo returns, it returns a pointer to the 2nd element of the class (the adjustor thunk will ensure that the right thing happens when you call into the member functions).

The IFoo vtable is laid out in memory like this:

1 QueryInterface()
2 AddRef()
3 Release()
4 FooMethod1()
5 FooMethod2()

The IFoo2 vtable on the other hand is laid out in memory like this:

1 QueryInterface()
2 AddRef()
3 Release()
4 FooMethod1()
5 FooMethod2()
6 Foo2Method1()
7 Foo2Method2()

When the caller calls into a method on IFoo, the compiler will index into the vtable to find the pointer to the code that implements the specified method.  By casting from an IFoo to an IFoo2, my co-worker was telling the compiler "I know that this thing is also an IFoo2, so you should act like the vtable is really an IFoo2 vtable.

 

The first time that he called into Foo2Method1 or Foo2Method2 using this mechanism, if he was lucky, his code would crash.  If he was unlucky (and the random chunk of memory sitting at the end of the vtable for IFoo happened to be code that matched the function signature of Foo2Method2, he would simply corrupt memory.

All-in-all, a bad thing to do.

In addition, if the particular class implementation of IFoo chose to implement IFoo2 via COM aggregation (in other words, using the delegator pattern), his cast would defeat that.

 

The right thing to do in this case is to call QueryInterface on the IFoo object looking for IFoo2, then release the IFoo object since it's no longer needed.

 

Edit: Fixed typos.

Mostly because Larry's been swamped with work :).

 

I've been heads down working on stuff and it's pretty much monopolized my time for the past month or so.  The good news is that my schedule has relaxed a bit (although I still have what seems to be a never ending stream of threat models to review).

Michael Howard just announced that we've hired Crispin Cowan!

This is incredibly awesome, I have a huge amount of respect for Crispin, he's one of the most respected researchers out there.

Among other things, Crispin's the author and designer of AppArmor, which adds sandboxing capabilities to Linux.  Apparently he's going to be working on the core Windows Security team, which is absolutely cool.

 

I'm totally stoked to hear this - I literally let out a whoot when I read Michael's blog post.

 

 

Welcome aboard Crispin :).

Continuing in a stream of blog posts that started with "18 years ago, Today", "Nineteen years ago", and "20 years and going strong", today Valorie and I celebrate our 21st wedding anniversary.  We can finally go out and have a drink to celebrate :)[1].

I'm still as much in love with Valorie as I was on the day we were married, oh so many years ago.  It's been a long wonderful road. 

 

Valorie is without question my best friend, and I love her.

 

 

Happy Anniversary, Dear.

 

PS: One of the cards might be a bit hard to find this year.

 

[1] Of course, since Valorie doesn't drink and I only rarely drink this doesn't really change things, but whatever :).

Someone sent the following screen shot to one of our internal troubleshooting aliases.  They wanted to know what the "Name Not Available" slider meant.

clip_image002[7]

 

The audio system on Vista keeps track of the apps that are playing sounds (it has to, to be able to display the information on what apps are playing sounds :)).  It keeps this information around for a period of time after the application has made the sound to enable the scenario where your computer makes a weird sound and you want to find out which application made the noise.

The system only keeps track of the PID for each application, it's the responsibility of the volume mixer to convert the PID to a reasonable name (the audio service can't track this information because of session 0 isolation).

This works great, but there's one possible problem: If an application exits between the time when the application made a noise and the system times out the fact that it played the noise, then the volume mixer has no way of knowing what the name of the application that made the noise was. In that case, it uses the "Name Not Available" text to give the user some information.

Those of you who know me (and my family) from beyond my blog know that among my our many passions, one of the biggest is books.  And we've got a lot of them.

A couple of years ago, Valorie got me a Flic barcode scanner and a copy of the program Book Collector.  I've been using it steadily since then adding the books from the biggest of the 4 different libraries in our house (yeah, we've got 4 separate libraries in the house, I did say we read a lot of books - they are: grown up fiction, kids fiction, non fiction and teaching materials).

Since I was fortunate enough to take the month of December off this year (one of the benefits of working at Microsoft for as long as I have is that I get a lot of vacation time, some of which was due to disappear), I decided to take on the project of working through the books in the big library.

Since I've been working on it pretty much every day off and on for the past 4 weeks, I've looked at a LOT of books recently.

I've developed a pretty good workflow (it could unquestionably be improved, but this one works) for the process of scanning books:

  1. Start Book Collector.
  2. Head into the library from my computer with the Flic scanner in hand.
  3. Start where I last scanned.
  4. Take a book off the shelf: 
    1. If the back cover contains a UPC code, check the barcode - if it starts 978 or 979, scan it and go to step 5.
    2. Open the book to the inside front flap.  If it contains a UPC code, scan it and go to step 5.
    3. Put the book on the "to be scanned manually" pile.
  5. Take the books on the "to be scanned manually" pile back to the computer.
  6. Plug the scanner into the computer, which will cause the UPC codes to be read in and let the program search for the books.
  7. Add the books found by the scanner to the program.
  8. For each book on the "to be scanned manually" pile:
    1. Check for an ISBN number on the back of the book (usually near the UPC code).  If it's there, enter it and go to step 9.
    2. Check the spine of the book.  If one of the numbers there looks like an ISBN number, enter it and go to step 9 (I often combine step 8.1 and 8.2 together).
      1. If the number on the spine looks like an ISBN number but is 1 digit too short, try typing it in but add an "X" for the last digit.
    3. Look at the page after the title page of the book - sometimes there's an ISBN number there, if so, enter it and go to step 9.
    4. If no ISBN number is found, put the book on the "to be entered manually" pile.
  9. Have the program scan for the ISBN numbers you just entered, verifying each one as it's found, then add the books if they're correct.
  10. For each book on the "to be entered manually" pile:
    1. Enter the title and author for the book manually
    2. Search for the book.
    3. Walk through the items found adding the best fit ("best fit" can be subjective, I try to find an entry with accurate cover art or an accurate publishers number - but the only books that fall into that final set of books are typically more than 20 years old, so your ability to find accurate information on those books is spotty).
  11. Pick up the books you took from the library and to back to step 2.

 

That's it.  So far I'm up to just short of 3000 books scanned, and I'm in the middle of the letter "S" in the biggest of the 4 libraries.  I've added more than 2 thousand books to the program this  month (wow).

Some things I've noticed in the process...

  • We have a lot of books.
  • As far as the scanning process goes, books fall into roughly five categories:
    1. Those with ISBN numbers on their UPC code - this includes most graphic novels, hard cover books and trade books.  Many new paperback books have ISBN13 numbers on their UPC code, but some still don't.
    2. Those with ISBN numbers in a bar code on the inside front cover - essentially this category contains all paperback books from some time in the early 1990s to the present.
    3. Those without ISBN numbers in a bar code on the inside front cover, but with UPC codes that include the ISBN number.  Essentially this category contains all books from the mid 1980s to the early 1990s.
    4. Those without a UPC code on the book, but with an ISBN number (or sometimes a SBN number) inside the book.  This includes most (but not all) books from the 1970s.
    5. Those without ISBN numbers at all.  This includes most books before the early 1970s (yeah, I've got paperback books that date from the mid 1960s).
  • For books that have ISBN numbers, the amount of information available about the book depends highly on how new it is.  For those that post-date Amazon and Barnes&Noble (the primary data sources for Book Collector), the information available is quite good (including reasonably accurate book cover images).  For older books, the information available is spotty, usually dependant on the information that 3rd party sellers provide to the various online retailers.
  • We have an awful lot of books.

I'm only beginning the process of taming the book collection, and I've not even started thinking about dealing with how to maintain the library going forward (but I've got some ideas).  As I said, my workflow above could be improved (for instance, I should use a laptop and take the computer to the library to deal with the "to be scanned manually" pile instead of schlepping the pile back and forth.

But working through the piles has absolutely been an enjoyable process - I've also appreciated the opportunity to re-discover old friends, which is always a good thing.

More Posts Next page »
 
Page view tracker