Notes on comments.
Welcome to our blog dedicated to the engineering of Microsoft Windows 7
This has been a busy couple of days for a few of us on the team as we had a report of a bug in Windows 7. The specifics of the issue are probably not as important as a discussion over how we will manage these types of situations down the road and so it seems like a good time to provide some context and illustrate our process, using this recent example.
This week a report on a blog described a crashing issue in Windows 7. The steps to reproduce the crash were pretty easy (1) run chkdsk /r on a non-system drive then crash after consuming system memory. Because it was easy to “reproduce”, the reports of this issue spread quickly. Subsequent posts and the comments across the posts indicated that the issue seemed to have been reproduced by others—that is the two characteristics of the report were seen (a) consumption of lots of memory and (b) crashing.
Pretty quickly, I started getting a lot of mail personally on the report. Like many of you, the first thing I did was try it out. And as you might imagine I did not reproduce both issues, though I did see the memory usage. I tried it on another machine and saw the same behavior. In both cases the machine functioned normally during and after the chkdsk. As I frequently do, I answered most of the mail I receive and started asking people for steps to reproduce the crash and to share system dump files. The memory usage did not worry me quite as much as the crash. I began having a number of interesting mail threads, but we didn’t have any leads on a repro case nor did we have a crash dump to work with.
Of course I was not the first Microsoft person to see this. The file system team immediately began to look into the issue. They too were unable to reproduce the crash and from their perspective the memory usage was by design and was a specific Windows 7 change for this scenario (the /r flag grabs an exclusive lock and repairs a disk and so our assumption is you’d really like the disk to be fixed before you do more stuff on the machine, an assumption validated by several subsequent third party blog posts on this topic). We cast the net further and continued looking for crash dumps and reports. As described below we have quite a few tools at our disposal.
While we continued to investigate, the mail I was getting was escalating in tone and more importantly one of the people I responded to mentioned our email exchange in a blog post. So in my effort to have a normal email dialog I ended up in the thick of the discussion. As I have done quite routinely during the development of Windows 7, I added a comment on the original blog (and the blog where this particular email friend was commenting) outlining the steps we are taking and the information we knew to date. Interestingly (though not unfortunately) just posting the comment drew even more attention to the issue being raised. I personally love being a member of the broader community and enjoy being a casual contributor even when it seems to cause a bit of a stir.
It is worth just describing the internal process that goes on when we receive a report of a crashing issue. Years ago we had one of two reactions. Either we would just throw up our arms and surrender as we had no hope of finding the bug, or we would drop everything and start putting people on airplanes with terminal debuggers in the hopes of finding a reproducible case. Neither of these is particularly effective and the latter, while very heroic sounding, does not yield results commensurate with effort. Most importantly while there might be a crash, we had no idea if that was the only instance or if lots more people were seeing or would see the crash. We were working without any data to inform our decisions.
With the internet and telemetry built into our products (not just Windows 7) we now have a much clearer view of the overall health of the software. So when we first hear a report of a crash we check to see if we’re seeing the crash happen on the millions of machines that are out there. This helps us in aggregate, but of course does not help us if a crash is one specific configuration. However, a crash that is one specific configuration will still show up if there is any statistically relevant sampling of machines and given the size of the user base this is almost certain to be the case. We’re able to, for example, query the call stacks of all crashes reported to see if a particular program is on the stack.
We have a number of tools at our disposal if we are seeing a crash in telemetry. You might have even seen these at work if you crash. We can increase (with consent) the amount of data asked for. We can put up a knowledge base article as a response to a crash (and you are notified in the Windows 7 Action Center). We can even say “hey call us”. As crazy as that one might sound, sometimes that is what can help. If a crashing issue in an already shipping product suddenly appears then something changed—a new hardware device, new device driver, or other software likely caused the crash to appear far more frequently. Often a simple confirmation of what changed helps us to diagnose the issue. I remember one of the first times we saw this was when one day unexpectedly Word started crashing for people. We hadn’t changed anything. It turned out a new version of a popular add-in released and the crash was happening in the add-in, but of course end-users only saw Word crashing. We quickly put up instructions to remove the add-in while in parallel working with the ISV to push out a fix. This ability to see the changing landscape, diagnose, and respond to a problem has radically changed how we think of issues in the product.
We are constantly investigating both new and frequently occurring issues (including crashes, hangs, device not found, setup failures, potential security issues, and so on). In fact we probably look into on the order of hundreds of issues in any given month as we work with our enterprise and OEM customers (and therefore hardware partners, ISVs, etc.). Often we find that issues are resolved by code changes outside core Windows code (such as with drivers, firmware, or ISV code). This isn’t about dodging responsibility but helping to fix things at the root cause. And we also make many code changes in Windows, which are seen as monthly updates, hotfixes, and then service pack rollups. The vast majority of things we fix are not applicable broadly and hence not released with immediate urgency—if something is ever broadly applicable we will make the call to release it broadly. It is very important for everyone to understand how seriously we take the responsibility of making sure there are no critical issues impacting a broad set of customers, while also balancing the volume of changes we push out broadly.
To be specific about the investigation around the chkdsk utility, let’s look at how we dove into this over the past couple of days. We first looked through our crash telemetry (both at the user level and “blue screen” level) and found no reported crashes of chkdsk. We of course look through our existing reports of issues that came up during the development of Windows 7, but we didn’t see anything at all there. We queried the call stacks of existing reported crashes (of all kinds, since this was reported) and we did not find any crashes with chkdsk.exe running while crashing. We then began automated test runs on a broad set of machines—these ran overnight and continued for 2 days. We also saw reports related to a specific hardware configuration, so we set up over 40 machines based on variants of that chipset, driver, and firmware and ran those tests. We were not hitting any crashes (as mentioned, the memory usage was already understood). Because some were saying the machines were non-responsive we also looked for that in manual tests and didn’t see anything. We also broadened this to request globally to Microsoft folks to try things out (we have quite a few unique configs when you think of all of our offices around the world) and so we had several hundred more test runs going. We also had reports of the crash happening when running without a pagefile—that could be the case, but that would not be an issue with this utility as any program that requests more memory than physically available would cause things to tip over and this configuration is not recommended for general purpose use (and this appears to be the common thread on the small number of non-reproducible crashes). Folks interested might read Mark’s blog on the topic of pagefiles in general. While we did not identify anything of note, that does not rule out the possibility of a problem but at this point the chances of any broad issue are extremely small.
In the meantime, we continue to look through external blogs, forums and other reports of crashes to see if we can identify any reproducible cases of this. While we don’t contact everyone, we do contact people if the forum and report indicate this has a good chance of yield. In all fairness, it probably doesn’t help us when there’s a lot of “smoke” while we’re trying to find the fire. We had a lot of “showstopper” comments piling on but not a lot of additional data including a lack of a reproducible case or a crash dump.
This type of work will continue until we have satisfied ourselves that we have systematically ruled out a crash or defined the circumstances where a crash can happen. Because this is a hardware/software related issue we will also invite input from various IHVs on the topic. In this case, because it is disk related we can’t rule out the possibility that in fact the disk was either failing or about to fail and the excessive use of the disk during a /r repair would in fact generate a failure. And while the code is designed to handle these failures (as you can imagine) there is the possibility that the specific failure is itself not handled well. In fact, in our lab (running tests continuously for a few days) we had one failure in this regard and the crash was in the firmware of the controller for the disk. Obviously we’ll continue to investigate this particular issue.
I did want folks to know just how seriously we take these issues. Sometimes blogs and comments get very excited. When I see something like “showstopper” it gets my attention, but it also doesn’t help us to have a constructive and rational investigation. Large software projects are by nature extremely complex. They often have issues that are dependent on the environment and configuration. And as we know, often as deterministic as software is supposed to be sometimes issues don’t reproduce. We have a pretty clear process on how we investigate reports and we focus on making sure Windows remains healthy even in the face of a changing landscape. With this post, I wanted to offer a view into some specifics but also into the general issue of sounding alarms.
It is always cool to find a bug in software. Whether it is an ATM, movie ticket machine, or Windows we all feel a certain sense of pride in identifying something that doesn’t work like we think it should. Windows is a product of a lot of people, not just those of us at Microsoft. When something isn’t as it should be we work with a broad set of partners to make sure we can effectively work through the issue. I hope folks recognize how serious we take this responsibility as we all know we’re going to keep looking at issues and we will have issues in the future that will require us to change the code to maintain the level of quality we know everyone expects of Windows.
it must be a fairly tedious process to filter out information that's actually useful with regard to the actual issue at hand from the interwebz. case in point being that most reports i've seen on the issue illustrate that most users are unaware that /r eats RAM for breakfast by design.
thanks for the followup with the topic. indeed i was also not able to reproduce the "showstopper" bug.i completely agree with the complexity of whole ecosystem of windows. the topic got a lot of unnecessary attention in the blogs when no one was able to reproduce the crash. People are so obsessed with win7 that any small issue create a lot of noises. anyways, looking forward for the GA.
Excellent article. I believe that the 66 articles in this blog (and I hope there will be many more to come) would make an excellent book about how to build a commercial operating system.
The transparency that your team and Microsoft has shown in this and all of the Windows 7 related blogs has been phenomenal.
I have to agree with RonV (and many others) that most of the articles here are very interesting.
Here I have some suggestions and wishes for further topics which I think are widely discussed among Windows users.
1. There is a widespread believe that the Windows OS performs best when it is freshly installed but then gets slower with time. Can you confirm that with the data you have collected? Have you done some work concerning that issue? If yes, why is that? Is the registry really a factor in that matter?
2. How much do background processes like updaters influence my PC's performance? Is it reasonable to deactivate third party services with msconfig or does Windows handle them in a way that makes them only use ressources if they are really active?
As far as the crash is concerned i did have any major crash on RC but sometime comuter hangs on welcome screenit dosent move The we need to boot the computer in safe mode That works & guess what without making any changes if we restart the computer it boots.
Having received my fair share of bug reports from crazies and liars, I don't envy you having to wade trough all the junk a company of your size must receive by the bucketload.
Excellent explanation..Thanks Steven..
We like Win7 a lot..its really stable & good.
Thank you for this blog. It is most helpful to those of us who need to explain to users that some of the stories about problems may not be widespread or reproducible.
The various Windows teams seem to be all agile, however the shell team (which btw needs to be fired and re-formed) I think just sits on a bug or denies it's a bug in the first place and most of the time fails to understand what the reporter is exactly describing conveniently ignoring the issue. XP will continue to dominate the Windows marketshare until the XP shell and Windows Explorer are restored completely with all destroyed features.
I kind of envy the infrastructure Microsoft has created for tracking not-reproducible bugs down...
About "performance". Windows performance degrades over time because Windows is not user-friendly. I say this because it is very easy to help user to improve the things up but Windows does not do it.
While I don't have access to MS's telemetry data I know that Windows system slows down over time because of 3 general reasons:
- Hard Drive has < 50% of free capacity, sometimes even less than 20% which is a showstopper. Windows should suggest at 50% point to upgrade system's hard drive.
- %TEMP% directories have tens of thousands of files and folders. These are buried deep in the folder hierarchy, hidden and not accessible to normal users. Users generally dismiss any Disk Cleanup dialogs should one appear. Windows must proactively kill files in the Temp folders after, say, 1 month and after 10 computer restarts.
- Lots of different auto starting utilities from Adobe and others. Windows should include "Performance Tuning Wizard", which automatically starts in 3 months after Windows installation, knows about top 500 of useless applications in Notification Area and just suggests user to get rid of them all with one click. After that Adobe will be only happy to know that its software started to produce much less carbon dioxide than before.
- Shell Extensions and Internet Explorer Add-ons (like Google Toolbar and Sun Java runtime but). If Internet Explorer is not a Microsoft’s sacrifice being made to persuade everybody that MS is not a monopoly then MS should be aware that IE is perceivably slower than competing browsers because average user has all kinds of possible IE add-ons installed. Users typically install these and never uninstall anything because they don’t understand that it’s the reason of IE performance degradation.
There are many other areas where Windows could detect problems and automatically fix it. For instance, Windows knows when antivirus software is not installed. So Microsoft Windows could easily warn poor users that they have more than 1 antivirus installed.
Steven was giving link many posts ago to some forum and threads about ini files, Registry and similar things. It was one conclusion there - we have created Registry and we will use, even if it has limits. EOT.
There is visible one general problem in such thinking. NT based win32 ecosystem is (too) big. Developers are afraid of changes because of application compatibility.
When you add into it some more and more "corporate friendly" solutions and ideas (like infamous DRM), you will see, that this platform lost freshness and (in my opinion) asking, if this is or other solution is good doesn't have sense.
Additionally - for many people XP is good enough and known enough. And they will use it up to the end (max. 2014).
What this it mean ? In my opinion, nothing good.
What MS could do ? Windows NG - small, in micro-kernel architecture with a lot of virtualization and partial win32 compatibility. Without Registry and similar things.
As in the past, another excellent explanation of a process. The issue with Chkdsk came to my attention first as a forum entry by Chris123NT on GeekSmack.net. I think Chris had good intentions posting it.
Your writeup has done much to help clear the subsequent confusion. But I don't see a clearly stated conclusion, is this a bug or is it a feature? In other words, the program is working as designed right? Crashing is not common, so therefore, nothing to worry about for the average user?
So why didn't MS get the one crash report from the original person that saw his machine crash?
It begs the question: Is MS getting all the telemetry it thinks it is?
Of course, that guy could have had the option turned off. Curious to know...
@Vyacheslav Lanovets, very well said. You've pinned some very accurate reasons why Windows shows performance degradation over time.
My Start menu on Windows 7 RC includes about 95+ items/programs but the "All Programs" menu now comes up completely empty. This happened in the beta too after I installed more apps. If you search a bit, you'll realize internet forums are littered with this problem. I'm forced to use a different launcher app to launch my programs instead of the revolutionary-touted Start search. Is the new Start menu not robust enough to display an infinite number of installed programs under "All Programs" or is it the shell team this time again that is notorious for the extremely shabby work it has done on Windows since the Vista release? In all of these years of using Windows, I've never found installed programs missing the launcher except in these recent OSes.
@sroussey: so why didn't MS get a crash report from the original person?
Because he had the page file turned off. The crash dump is written to the page file when the system crashes.