Notes on comments.
Welcome to our blog dedicated to the engineering of Microsoft Windows 7
This has been a busy couple of days for a few of us on the team as we had a report of a bug in Windows 7. The specifics of the issue are probably not as important as a discussion over how we will manage these types of situations down the road and so it seems like a good time to provide some context and illustrate our process, using this recent example.
This week a report on a blog described a crashing issue in Windows 7. The steps to reproduce the crash were pretty easy (1) run chkdsk /r on a non-system drive then crash after consuming system memory. Because it was easy to “reproduce”, the reports of this issue spread quickly. Subsequent posts and the comments across the posts indicated that the issue seemed to have been reproduced by others—that is the two characteristics of the report were seen (a) consumption of lots of memory and (b) crashing.
Pretty quickly, I started getting a lot of mail personally on the report. Like many of you, the first thing I did was try it out. And as you might imagine I did not reproduce both issues, though I did see the memory usage. I tried it on another machine and saw the same behavior. In both cases the machine functioned normally during and after the chkdsk. As I frequently do, I answered most of the mail I receive and started asking people for steps to reproduce the crash and to share system dump files. The memory usage did not worry me quite as much as the crash. I began having a number of interesting mail threads, but we didn’t have any leads on a repro case nor did we have a crash dump to work with.
Of course I was not the first Microsoft person to see this. The file system team immediately began to look into the issue. They too were unable to reproduce the crash and from their perspective the memory usage was by design and was a specific Windows 7 change for this scenario (the /r flag grabs an exclusive lock and repairs a disk and so our assumption is you’d really like the disk to be fixed before you do more stuff on the machine, an assumption validated by several subsequent third party blog posts on this topic). We cast the net further and continued looking for crash dumps and reports. As described below we have quite a few tools at our disposal.
While we continued to investigate, the mail I was getting was escalating in tone and more importantly one of the people I responded to mentioned our email exchange in a blog post. So in my effort to have a normal email dialog I ended up in the thick of the discussion. As I have done quite routinely during the development of Windows 7, I added a comment on the original blog (and the blog where this particular email friend was commenting) outlining the steps we are taking and the information we knew to date. Interestingly (though not unfortunately) just posting the comment drew even more attention to the issue being raised. I personally love being a member of the broader community and enjoy being a casual contributor even when it seems to cause a bit of a stir.
It is worth just describing the internal process that goes on when we receive a report of a crashing issue. Years ago we had one of two reactions. Either we would just throw up our arms and surrender as we had no hope of finding the bug, or we would drop everything and start putting people on airplanes with terminal debuggers in the hopes of finding a reproducible case. Neither of these is particularly effective and the latter, while very heroic sounding, does not yield results commensurate with effort. Most importantly while there might be a crash, we had no idea if that was the only instance or if lots more people were seeing or would see the crash. We were working without any data to inform our decisions.
With the internet and telemetry built into our products (not just Windows 7) we now have a much clearer view of the overall health of the software. So when we first hear a report of a crash we check to see if we’re seeing the crash happen on the millions of machines that are out there. This helps us in aggregate, but of course does not help us if a crash is one specific configuration. However, a crash that is one specific configuration will still show up if there is any statistically relevant sampling of machines and given the size of the user base this is almost certain to be the case. We’re able to, for example, query the call stacks of all crashes reported to see if a particular program is on the stack.
We have a number of tools at our disposal if we are seeing a crash in telemetry. You might have even seen these at work if you crash. We can increase (with consent) the amount of data asked for. We can put up a knowledge base article as a response to a crash (and you are notified in the Windows 7 Action Center). We can even say “hey call us”. As crazy as that one might sound, sometimes that is what can help. If a crashing issue in an already shipping product suddenly appears then something changed—a new hardware device, new device driver, or other software likely caused the crash to appear far more frequently. Often a simple confirmation of what changed helps us to diagnose the issue. I remember one of the first times we saw this was when one day unexpectedly Word started crashing for people. We hadn’t changed anything. It turned out a new version of a popular add-in released and the crash was happening in the add-in, but of course end-users only saw Word crashing. We quickly put up instructions to remove the add-in while in parallel working with the ISV to push out a fix. This ability to see the changing landscape, diagnose, and respond to a problem has radically changed how we think of issues in the product.
We are constantly investigating both new and frequently occurring issues (including crashes, hangs, device not found, setup failures, potential security issues, and so on). In fact we probably look into on the order of hundreds of issues in any given month as we work with our enterprise and OEM customers (and therefore hardware partners, ISVs, etc.). Often we find that issues are resolved by code changes outside core Windows code (such as with drivers, firmware, or ISV code). This isn’t about dodging responsibility but helping to fix things at the root cause. And we also make many code changes in Windows, which are seen as monthly updates, hotfixes, and then service pack rollups. The vast majority of things we fix are not applicable broadly and hence not released with immediate urgency—if something is ever broadly applicable we will make the call to release it broadly. It is very important for everyone to understand how seriously we take the responsibility of making sure there are no critical issues impacting a broad set of customers, while also balancing the volume of changes we push out broadly.
To be specific about the investigation around the chkdsk utility, let’s look at how we dove into this over the past couple of days. We first looked through our crash telemetry (both at the user level and “blue screen” level) and found no reported crashes of chkdsk. We of course look through our existing reports of issues that came up during the development of Windows 7, but we didn’t see anything at all there. We queried the call stacks of existing reported crashes (of all kinds, since this was reported) and we did not find any crashes with chkdsk.exe running while crashing. We then began automated test runs on a broad set of machines—these ran overnight and continued for 2 days. We also saw reports related to a specific hardware configuration, so we set up over 40 machines based on variants of that chipset, driver, and firmware and ran those tests. We were not hitting any crashes (as mentioned, the memory usage was already understood). Because some were saying the machines were non-responsive we also looked for that in manual tests and didn’t see anything. We also broadened this to request globally to Microsoft folks to try things out (we have quite a few unique configs when you think of all of our offices around the world) and so we had several hundred more test runs going. We also had reports of the crash happening when running without a pagefile—that could be the case, but that would not be an issue with this utility as any program that requests more memory than physically available would cause things to tip over and this configuration is not recommended for general purpose use (and this appears to be the common thread on the small number of non-reproducible crashes). Folks interested might read Mark’s blog on the topic of pagefiles in general. While we did not identify anything of note, that does not rule out the possibility of a problem but at this point the chances of any broad issue are extremely small.
In the meantime, we continue to look through external blogs, forums and other reports of crashes to see if we can identify any reproducible cases of this. While we don’t contact everyone, we do contact people if the forum and report indicate this has a good chance of yield. In all fairness, it probably doesn’t help us when there’s a lot of “smoke” while we’re trying to find the fire. We had a lot of “showstopper” comments piling on but not a lot of additional data including a lack of a reproducible case or a crash dump.
This type of work will continue until we have satisfied ourselves that we have systematically ruled out a crash or defined the circumstances where a crash can happen. Because this is a hardware/software related issue we will also invite input from various IHVs on the topic. In this case, because it is disk related we can’t rule out the possibility that in fact the disk was either failing or about to fail and the excessive use of the disk during a /r repair would in fact generate a failure. And while the code is designed to handle these failures (as you can imagine) there is the possibility that the specific failure is itself not handled well. In fact, in our lab (running tests continuously for a few days) we had one failure in this regard and the crash was in the firmware of the controller for the disk. Obviously we’ll continue to investigate this particular issue.
I did want folks to know just how seriously we take these issues. Sometimes blogs and comments get very excited. When I see something like “showstopper” it gets my attention, but it also doesn’t help us to have a constructive and rational investigation. Large software projects are by nature extremely complex. They often have issues that are dependent on the environment and configuration. And as we know, often as deterministic as software is supposed to be sometimes issues don’t reproduce. We have a pretty clear process on how we investigate reports and we focus on making sure Windows remains healthy even in the face of a changing landscape. With this post, I wanted to offer a view into some specifics but also into the general issue of sounding alarms.
It is always cool to find a bug in software. Whether it is an ATM, movie ticket machine, or Windows we all feel a certain sense of pride in identifying something that doesn’t work like we think it should. Windows is a product of a lot of people, not just those of us at Microsoft. When something isn’t as it should be we work with a broad set of partners to make sure we can effectively work through the issue. I hope folks recognize how serious we take this responsibility as we all know we’re going to keep looking at issues and we will have issues in the future that will require us to change the code to maintain the level of quality we know everyone expects of Windows.
Hm, I saw chkdsk memory leak on my main machine and on virtual PC. There is no emergency shutdown, BSOD or something other, just memory leak. So, why chkdsk doesn't uses so much memory in Windows Vista?
I believe that the Microsoft Windows 7 has fewer bugs and problems than the previous versions. Maybe it was just an impression or maybe was true but I like it a lot more.
Like wguimb, I too don't see a conclusion, and yet I'm curious... is the issue resolved now? Are you still investigating? Have you found the cause, and if so - what is/was it? HDD firmware? Chkdsk code base? Other parts of the Windows code base? AV or another kind of third party software interfering somehow?
I have two physical HDDs - one with WinXP and one with Win7 RC installed, both NTFS formatted. Dual-boot between systems.
1. Load Win7.
2. In second (with WinXP) hard drive Properties start Check Disk, place both checkmarks on.
3. In Task Manager see Explorer usage memory grows up to 1.7 Gb (out of 3).
4. When surface testing stage starts, decide it too much time and click Cancel.
5. Wait more than 30 minutes, see Explorer memory does NOT released.
6. Decide it is quite enough and restart system using start menu.
7. In boot menu select WinXP.
8. System does NOT load!!!
9. Restart in Win7. Before-the-boot checkdisk invokes. Found many ACL errors(!!!), fixed them. Ugh.
10. Restart in WinXP - OK.
Is it enough to place shipstopper bug???
That's a great article, but one question remains: how to report bugs in Windows in the first place? (I have two bugs in Windows 7 to report). Thanks!
I think one of the best ways is to use a contact form over here.
The reason is pretty simple - all such requests come to emails of blog owners. Email at Microsoft has a great power - once one got it he can send it to anyone else (mean escalate) inside Microsoft in a pretty simple way.
It's great to hear what you do with bug reports but I that you would publish a page where people could post bugs that they have found.
I love Windows 7 but there is a "bug" that I've experienced in BETA, RC1 and now RTM.
If a Windows 7 machine goes to sleep or hibernation there is an issue with the Task bar.
After the "sleep" occurs when I hover over open programs the small preview window does not have a small preview of the window instead it's a blank window.
You have to actually open the window all the way. Then minimize the window and then hover your mouse over the task bar. The small preview will now show.
This happens on the only two machines I've been able to install Windows 7 RTM on. Both machines are using the latest drivers for all items. The laptop is using an ATI graphics and the Desktop is using an nVidia GTX280 SSCE.
Hopefully this gets resolved soon. It's not so much an issue with the desktop because I prefer not to put it to sleep because I do a bunch of downloading on it. My laptop on the other hand is a different story.
I've been thinking about this a lot over the last few years, years during which I have been using Microsoft products at work and had no effective bug reporting mechanism. I can tell my IT help desk, but I don't suppose they ever pass on what I have said. I've also come across the case where we have to be subscribers to even be allowed to report problems. Paying to report bugs is rather galling.
At home I've been using Open Source products a lot. I think that one reason is the bug reporting process. I get to use something like Bugzilla, file my screen dumps and test files - and I can go back later and see what the progress is. Even if there is none, the reason why is always explained. I can also search and see what other people have reported and what work-arounds may have been suggested. No-one denies that bugs can occur, and a clear path to resolution is always visible (Mozilla do keep security vulnerabilities 'in camera', of course)
I realise that the dominant supplier may have issues with volumes of reports from such a system - in FLOSS perhaps 40% are unfounded or repetitious, with a less sophisticated audience the level would be higher - but it cannot be beyond the wit of someone to find a way to offer the advantages of visibility and explanation without the problems of volume.
Thank you for this post and indeed the blog, I'm finding it very interesting. I have submitted
I have what I feel is an important a feature-request for Windows 7, but am not sure how best to submit it. It's not a bug.
Feature request: tabbed browsing in Windows Explorer
Tabs in Windows Explorer. It seems obvious when you think about it! :)
UI glitch found in many years ago, starting from vista to win7 now, like
or win7 only glitch
response said the issue were sent to team related, but glitch still find in vista sp2/win 7 RTM. Do teams really review the glitch?
I've had the same problems. I've even had instances where I've purchased Office 2007, it wouldn't install properly, and I couldn't get support. Personally, I consider that to be failure to deliver the product (which is, admittedly, a fairly common software phenomenon). Obviously, I was quite disappointed. Installation support should be included with the software; otherwise, users should be able to return opened software. Wouldn't you agree?
Consequently, I've been implementing OpenOffice on many of the machines in our office (where possible) as a result. So far, I like Windows 7 A LOT. But the support model leaves a lot to be desired...
@cvpsmith -- Office 2007 provides 90 days free support in the US (different markets vary on this) and it is via telephone or email. It is particularly designed to assist with installation. Please see http://office.microsoft.com/en-us/FX102751391033.aspx for options.
Vyacheslav Lanovets: Dude, I sure hope you are not a troll.
1. You do realise that Windows has a Low Disk Space Notification, right? And that the Numero Uno question that gets asked about this is how to TURN IT OFF?
2. Preemptively wiping out temp files sounds like a really good idea, right up to the point the headlines scream that the no-good Microsoft is deleting files without user knowledge or authorisation.
3. Ooh, now I can just envision the lawsuite coming out of this one. Microsoft's Anticompetitive Moves; Denies Competitor Applications From Running.
4. And MSFT is responsible for poor user behaviour how, exactly?
There are a lot of reasons why Windows starts slowing down; user perception, fragmented Registry, unclean uninstallations of apps, even VSS for that matter. And of course, six months down the line, you would have upgraded apps, patched the OS, and generally done a whole lot more *useful* stuff which nevertheless degrades performance.
Get rid of antivirus and instantly get back 5-10% of performance. If you don't mind being more careful about what you do and where you go.
I got frequent crash in Windows 7 RTM x64 in my PC. I am using Asus P6T Deluxe v2 with 6x2GB DDR3 RAM and 2 Seagate 1.5TB hard disks setup in RAID1 (mirror) environment.
The frequent crash cause my PC have to rebuild the RAID1 volume !
Sorry I forgot the error code for each blue screen but I do very hope that Microsoft could release some patches as soon as possible.
Thanks, that's useful article.