Engineering Windows 7

Welcome to our blog dedicated to the engineering of Microsoft Windows 7

What we do with a bug report?

What we do with a bug report?

  • Comments 69

This has been a busy couple of days for a few of us on the team as we had a report of a bug in Windows 7. The specifics of the issue are probably not as important as a discussion over how we will manage these types of situations down the road and so it seems like a good time to provide some context and illustrate our process, using this recent example.

This week a report on a blog described a crashing issue in Windows 7. The steps to reproduce the crash were pretty easy (1) run chkdsk /r on a non-system drive then crash after consuming system memory. Because it was easy to “reproduce”, the reports of this issue spread quickly. Subsequent posts and the comments across the posts indicated that the issue seemed to have been reproduced by others—that is the two characteristics of the report were seen (a) consumption of lots of memory and (b) crashing.

Pretty quickly, I started getting a lot of mail personally on the report. Like many of you, the first thing I did was try it out. And as you might imagine I did not reproduce both issues, though I did see the memory usage. I tried it on another machine and saw the same behavior. In both cases the machine functioned normally during and after the chkdsk. As I frequently do, I answered most of the mail I receive and started asking people for steps to reproduce the crash and to share system dump files. The memory usage did not worry me quite as much as the crash. I began having a number of interesting mail threads, but we didn’t have any leads on a repro case nor did we have a crash dump to work with.

Of course I was not the first Microsoft person to see this. The file system team immediately began to look into the issue. They too were unable to reproduce the crash and from their perspective the memory usage was by design and was a specific Windows 7 change for this scenario (the /r flag grabs an exclusive lock and repairs a disk and so our assumption is you’d really like the disk to be fixed before you do more stuff on the machine, an assumption validated by several subsequent third party blog posts on this topic). We cast the net further and continued looking for crash dumps and reports. As described below we have quite a few tools at our disposal.

While we continued to investigate, the mail I was getting was escalating in tone and more importantly one of the people I responded to mentioned our email exchange in a blog post. So in my effort to have a normal email dialog I ended up in the thick of the discussion. As I have done quite routinely during the development of Windows 7, I added a comment on the original blog (and the blog where this particular email friend was commenting) outlining the steps we are taking and the information we knew to date. Interestingly (though not unfortunately) just posting the comment drew even more attention to the issue being raised. I personally love being a member of the broader community and enjoy being a casual contributor even when it seems to cause a bit of a stir.

It is worth just describing the internal process that goes on when we receive a report of a crashing issue. Years ago we had one of two reactions. Either we would just throw up our arms and surrender as we had no hope of finding the bug, or we would drop everything and start putting people on airplanes with terminal debuggers in the hopes of finding a reproducible case. Neither of these is particularly effective and the latter, while very heroic sounding, does not yield results commensurate with effort. Most importantly while there might be a crash, we had no idea if that was the only instance or if lots more people were seeing or would see the crash. We were working without any data to inform our decisions.

With the internet and telemetry built into our products (not just Windows 7) we now have a much clearer view of the overall health of the software. So when we first hear a report of a crash we check to see if we’re seeing the crash happen on the millions of machines that are out there. This helps us in aggregate, but of course does not help us if a crash is one specific configuration. However, a crash that is one specific configuration will still show up if there is any statistically relevant sampling of machines and given the size of the user base this is almost certain to be the case. We’re able to, for example, query the call stacks of all crashes reported to see if a particular program is on the stack.

We have a number of tools at our disposal if we are seeing a crash in telemetry. You might have even seen these at work if you crash. We can increase (with consent) the amount of data asked for. We can put up a knowledge base article as a response to a crash (and you are notified in the Windows 7 Action Center). We can even say “hey call us”. As crazy as that one might sound, sometimes that is what can help. If a crashing issue in an already shipping product suddenly appears then something changed—a new hardware device, new device driver, or other software likely caused the crash to appear far more frequently. Often a simple confirmation of what changed helps us to diagnose the issue. I remember one of the first times we saw this was when one day unexpectedly Word started crashing for people. We hadn’t changed anything. It turned out a new version of a popular add-in released and the crash was happening in the add-in, but of course end-users only saw Word crashing. We quickly put up instructions to remove the add-in while in parallel working with the ISV to push out a fix. This ability to see the changing landscape, diagnose, and respond to a problem has radically changed how we think of issues in the product.

We are constantly investigating both new and frequently occurring issues (including crashes, hangs, device not found, setup failures, potential security issues, and so on). In fact we probably look into on the order of hundreds of issues in any given month as we work with our enterprise and OEM customers (and therefore hardware partners, ISVs, etc.). Often we find that issues are resolved by code changes outside core Windows code (such as with drivers, firmware, or ISV code). This isn’t about dodging responsibility but helping to fix things at the root cause. And we also make many code changes in Windows, which are seen as monthly updates, hotfixes, and then service pack rollups. The vast majority of things we fix are not applicable broadly and hence not released with immediate urgency—if something is ever broadly applicable we will make the call to release it broadly. It is very important for everyone to understand how seriously we take the responsibility of making sure there are no critical issues impacting a broad set of customers, while also balancing the volume of changes we push out broadly.

To be specific about the investigation around the chkdsk utility, let’s look at how we dove into this over the past couple of days. We first looked through our crash telemetry (both at the user level and “blue screen” level) and found no reported crashes of chkdsk. We of course look through our existing reports of issues that came up during the development of Windows 7, but we didn’t see anything at all there. We queried the call stacks of existing reported crashes (of all kinds, since this was reported) and we did not find any crashes with chkdsk.exe running while crashing. We then began automated test runs on a broad set of machines—these ran overnight and continued for 2 days. We also saw reports related to a specific hardware configuration, so we set up over 40 machines based on variants of that chipset, driver, and firmware and ran those tests. We were not hitting any crashes (as mentioned, the memory usage was already understood). Because some were saying the machines were non-responsive we also looked for that in manual tests and didn’t see anything. We also broadened this to request globally to Microsoft folks to try things out (we have quite a few unique configs when you think of all of our offices around the world) and so we had several hundred more test runs going. We also had reports of the crash happening when running without a pagefile—that could be the case, but that would not be an issue with this utility as any program that requests more memory than physically available would cause things to tip over and this configuration is not recommended for general purpose use (and this appears to be the common thread on the small number of non-reproducible crashes). Folks interested might read Mark’s blog on the topic of pagefiles in general. While we did not identify anything of note, that does not rule out the possibility of a problem but at this point the chances of any broad issue are extremely small. 

In the meantime, we continue to look through external blogs, forums and other reports of crashes to see if we can identify any reproducible cases of this. While we don’t contact everyone, we do contact people if the forum and report indicate this has a good chance of yield. In all fairness, it probably doesn’t help us when there’s a lot of “smoke” while we’re trying to find the fire. We had a lot of “showstopper” comments piling on but not a lot of additional data including a lack of a reproducible case or a crash dump.

This type of work will continue until we have satisfied ourselves that we have systematically ruled out a crash or defined the circumstances where a crash can happen. Because this is a hardware/software related issue we will also invite input from various IHVs on the topic. In this case, because it is disk related we can’t rule out the possibility that in fact the disk was either failing or about to fail and the excessive use of the disk during a /r repair would in fact generate a failure. And while the code is designed to handle these failures (as you can imagine) there is the possibility that the specific failure is itself not handled well. In fact, in our lab (running tests continuously for a few days) we had one failure in this regard and the crash was in the firmware of the controller for the disk. Obviously we’ll continue to investigate this particular issue.

I did want folks to know just how seriously we take these issues. Sometimes blogs and comments get very excited. When I see something like “showstopper” it gets my attention, but it also doesn’t help us to have a constructive and rational investigation. Large software projects are by nature extremely complex. They often have issues that are dependent on the environment and configuration. And as we know, often as deterministic as software is supposed to be sometimes issues don’t reproduce. We have a pretty clear process on how we investigate reports and we focus on making sure Windows remains healthy even in the face of a changing landscape. With this post, I wanted to offer a view into some specifics but also into the general issue of sounding alarms.

It is always cool to find a bug in software. Whether it is an ATM, movie ticket machine, or Windows we all feel a certain sense of pride in identifying something that doesn’t work like we think it should. Windows is a product of a lot of people, not just those of us at Microsoft. When something isn’t as it should be we work with a broad set of partners to make sure we can effectively work through the issue. I hope folks recognize how serious we take this responsibility as we all know we’re going to keep looking at issues and we will have issues in the future that will require us to change the code to maintain the level of quality we know everyone expects of Windows.

--Steven

Leave a Comment
  • Please add 2 and 1 and type the answer here:
  • Post
  • Problem signature:

    Problem Event Name: BlueScreen

    OS Version: 6.1.7600.2.0.0.256.48

    Locale ID: 1033

    Additional information about the problem:

    BCCode: 4e

    BCP1: 00000099

    BCP2: 00033786

    BCP3: 00000000

    BCP4: 0003D086

    OS Version: 6_1_7600

    Service Pack: 0_0

    Product: 256_1

    Files that help describe the problem:

    C:\Windows\Minidump\092511-14055-01.dmp

    C:\Users\hemali.chauhan\AppData\Local\Temp\WER-55411-0.sysdata.xml

    Read our privacy statement online:

    go.microsoft.com/fwlink

    If the online privacy statement is not available, please read our privacy statement offline:

    C:\Windows\system32\en-US\erofflps.txt

    How can i solve this bug in windows 7 Pro.

  • I'd like to report a bug concerning keyboard layout handling, where can I do this?

    [My lowercase dotted name @gmail.com]

  • One key to understanding is to realize exactly why it is that the kind of bug report non–source-aware users normally turn in tends not to be very useful. www.billigeskoonline.net

  • so teenagers have many options. They may like to compete at high levels in Dressage, Jumping, Hacking. Or they may like to go on trail rides with a few friends or with a social riding club.  

  • Since the RAM is limited, and CHKDSK keeps running when he spent all the RAM available, it should be an option to set as much specifich RAM CHKDSK uses.

  • Engineering in Rajasthan has acquired its position as one of the main professional stream for the students over the last few years. Engineering is a popular domain which is all about new ideas and innovations. A lot of students these days seek for enrollment into the Best engineering colleges in Rajasthan. Rajasthan is a well known hub of Engineering education. From last many years best intellects have been trained by the Best Engineering Colleges in Rajasthan like Arya Institute of Engineering & Technology. The students from this best engineering College of Rajasthan are working in reputed organizations in India and abroad.

    Engineering is based on the technical field and help a student to build a strong foundation for an innovative future. Engineering is the branch of science that requires ones creativity and innovative skills to excel in their respective field. With the popular and widespread demands for engineers it has become the craze of the nation. In today’s generation craze of engineering is increasing day by day. It’s a very exiting profession where student get lots of ideas and knowledge about new technologies. The Best Engineering Colleges in Rajasthan like Arya Institute of Engineering & Technology is training the students in engineering fields to enable them to be creative have problem solving skills and ability to design things that matters. Branches like Information Technology, Computer Science, Civil, Electrical, Mechanical, Electronics etc are offered in Best Engineering Colleges in Rajasthan so that student can opt the branch of their interest.

    A bachelor’s degree in engineering provides students with the elementary engineering education they need to start stimulating careers in the engineering field. Engineering students have the chance to find the ideal forte for their career interests with a wide range of engineering branches available these days. Rajasthan has become one of the important educations Hub in India and is home to some of the best Engineering colleges in Rajasthan. Arya Institute of Engineering & Technology is considered to be one of the Best engineering colleges in Jaipur. Arya entered the association of best engineering institute in Rajasthan a decade ago. It is facilitating Engineering in Rajasthan with experienced faculty, most modern teaching methodologies and impressive infrastructure. Today, students of Arya Institute of Engineering & Technology, Best engineering College of Rajasthan and India, is setting benchmarks in the world with their innovations, improvisations, and occurrence.

  • Government Universities provides dedicated team and space for Research and Development while private universities do not focus on providing such dedicated team and space for Research and Development

  • The fix for this obvious BUG (its not by design like MS would like you to believe), is to simply use another disk utility.  One such OS that has utilities that actually work is called Linux.  Simply set your disk to offline mode, fire up a VMWare instance with linux in it (or install Linux and forget WinHose even exists), add the physical HD, and then proceed to run whatever disk utilities you need.  Problem solved.

  • "the /r flag grabs an exclusive lock and repairs a disk and so our assumption is you’d really like the disk to be fixed before you do more stuff on the machine, an assumption validated by several subsequent third party blog posts on this topic"

    Frankly, this is an asinine assumption.

    There are nine hard drives in this system, the one I am using chkdsk /r on can be tied up for three or four hours for a scan, but the rest of my system cannot and should not. There is no reason for chkdsk to be using 30GiB of memory; it's not going to make anything finish faster (the scan finishes in almost exactly the same amount of time no matter how much memory is in the system), it just bogs down the rest of the system for no reason.

    Yes, I can manually force the process to low priority, but there is still a perceptible slow down when starting and performing other tasks.

    There might have been some vague, almost plausible, rationale for chkdsk to use all available memory if this behavior was limited to consumer OSes, but it's not. This is a 24/7 system running Server 2008 R2 Enterprise. It hasn't been restarted in six months, and is always doing something important: chkdsk behaves identically in this OS as it does in Windows 7.

    Yes, I can use other tools to verify the integrity of my disks, but why should I have to?

    If this is truly by design, it's poorly designed. There is no reason a verification scan on one storage drive need impact the rest of my system in any way.

Page 5 of 5 (69 items) 12345