Welcome to MSDN Blogs Sign in | Join | Help

WER Services

This blog focuses on technologies that support Windows Error Reporting and Windows Quality data.
Microsoft PDC 2009

I had a chance to present a lunch session at Microsoft PDC 2009 this year!

http://microsoftpdc.com/Sessions/CL33 – This talk starts with creating a better customer experience around software failures using the Application Restart and Recovery (ARR) API, to downloading and debugging mini-dumps send in from customers, to creating customer responses pointing to updates and fixes.

Steven Sinofsky highlighted the importance of Windows Error Reporting and how Microsoft uses this technology to build Windows in his keynote at PDC 2009

Some of his talk was summarized in this CNET article:

“Sinofsky is talking about the different mechanisms Microsoft uses from Windows Error Reporting, or Watson, to its Software Quality Monitor. Sinofsky notes that the monitoring tools require the user's permission in the final versions of Windows.”

http://news.cnet.com/8301-13860_3-10400476-56.html?part=rss&subj=news&tag=2547-1_3-0-5

Here is Sinofsky’s keynote: http://microsoftpdc.com/Sessions/KEY02 (00:15:32, starts with Telemetry)

Other blog entries:

http://blogs.pcmag.com/miller/2009/11/pdc09_sinofsky_talks_about_cre.php

http://blogs.msdn.com/e7/archive/2009/05/11/OurNextEngineeringMilestone2.aspx

!Analyze – Automatic Root Cause Analysis

Meet the two engineers behind the !Analyze windows debugger extension!

http://channel9.msdn.com/posts/Charles/David-Grant-and-Ryan-Kivett-Analyze-Automatic-Root-Cause-Analysis/

!Analyze is an automatic root cause analysis tool for software failures. For years, it has provided insight to engineers both inside and outside of Microsoft. It is a key enabling technology behind numerous higher-level feedback systems, including Windows Error Reporting and Watson.
!Analyze runs millions of times each day, producing actionable results from reliability telemetry data sent to Microsoft. Ordinary debugging tools report the file and function where a failure ended. !Analyze pinpoints where the failure started.
How does it work, exactly? What's the story behind !Analyze?
Meet two of the Software Developers behind !Analyze, David Grant and Ryan Kivett. They share with us how !Analyze works, it's history and provide a glimpse into it's potential future.Tune in.

Using the web services to delete product mappings

In our June release we added web services for deleting mapped product(s), deleting mapped files from a product and deleting a mapped file from multiple products.

The following code sample shows how to delete mapped products.

IMPORTANT: Please use and/or test this code very carefully as this delete cannot be undone. 

The first part of the code is used to login to the web service and get the encrypted token. The second part of the code is for deleting mapped products. The web service for deleting mapped products should be called with a post parameter named "mappedproductid" with a comma separated list of mapped product id's.

Get the mapped product id's from the web service for getting the list of products.

   1:  string baseUrl = "https://winqual.microsoft.com";
   2:  string userName = "your winqual username";
   3:  string password = "your winqual password";
   4:   
   5:  //
   6:  // login
   7:  //
   8:  string loginUrl = string.Format(
   9:      "{0}/services/Authentication/Authentication.svc/BasicTicket"
  10:      , baseUrl);
  11:  string request = string.Format(
  12:      "<?xml version=\"1.0\" encoding=\"utf-8\"?><soap:Envelope xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><soap:Body><GetBasicTicket xmlns=\"https://winqual.microsoft.com/Services/Authentication/\"><userName>{0}</userName><password>{1}</password></GetBasicTicket></soap:Body></soap:Envelope>"
  13:      , userName
  14:      , password);
  15:  WebClient loginClient = new WebClient();
  16:  loginClient.Headers.Add(HttpRequestHeader.ContentType, "text/xml");
  17:  loginClient.Headers.Add(
  18:      "SOAPAction",
  19:      "https://winqual.microsoft.com/Services/Authentication/IBasicTicket/GetBasicTicket");
  20:  string loginResponse = loginClient.UploadString(loginUrl, request);
  21:   
  22:  //
  23:  // TODO: Handle condition where the ticket is null in case of bad pwd, bad username or some other error
  24:  //
  25:   
  26:  //
  27:  // parse the response to get the encrypted ticket out
  28:  //
  29:  int startingIndex = loginResponse.IndexOf("<GetBasicTicketResult>") + "<GetBasicTicketResult>".Length;
  30:  int endingIndex = loginResponse.IndexOf("</GetBasicTicketResult>");
  31:  string encryptedTicket = loginResponse.Substring(startingIndex, (endingIndex - startingIndex));
  32:   
  33:  //
  34:  // comma separated list of mapped product id’s (integers) to delete.
  35:  // IMPORTANT: This delete cannot be undone.
  36:  //
  37:   
  38:  string mappedProductIDs = "list of comma separated mapped product id's";
  39:  NameValueCollection nameValueCollection = new NameValueCollection(1);
  40:  nameValueCollection.Add("mappedproductid", mappedProductIDs);
  41:  WebClient webClient = new WebClient();
  42:  webClient.Encoding = Encoding.UTF8;
  43:  webClient.Headers.Add("encryptedTicket", encryptedTicket);
  44:  webClient.UploadValues(baseUrl + "/services/wer/user/deletemappedproducts.aspx", nameValueCollection);

This code will delete the mapped products and these products will look grayed out in the mapped product list at https://winqual.microsoft.com/member/wer/user/ManageProductMappings.aspx.

The code for deleting mapped files for a product and deleting mapped file from multiple products is similar to the code above.

The relative URL for deleting multiple mapped files from a single product is “/member/wer/user/DeleteMappedFilesForProduct.aspx”. This form requires two POST parameters. One is the “mappedproductid” parameter with the id of the mapped product and the second is “mappedfileid” which is the comma seperated list of mapped file id’s to delete.

The relative URL for deleting a single mapped file from multiple products is “/member/wer/user/DeleteMappedFileFromProducts.aspx”. This form requires two POST parameters. One is the “mappedfileid” parameter with the id of the mapped file and the second is “mappedproductid” which is the comma seperated list of mapped product id’s to delete the mapped file id.

The Three Cs of Response Satisfaction

The Windows Error Reporting (WER) platform offers software and hardware companies a way to provide helpful information to customers.  When an application stops working the WER client application that runs on Windows (since XP) detects these events and checks to see if a solution exists.  Software and hardware companies are able to create WER responses through the Windows Quality web portal (https://winqual.microsoft.com)

Windows end-users also have the ability to provide feedback on the quality of the solutions.  Through analyzing the feedback provided over the years and performing extensive usability studies we have found some simple rules that help to create a pleasant experience for end-users.  The 3 C’s are comprised of keeping Customer focus, providing Clear information, and keeping your information Current.

Customer focus – Keep the customer’s experience paramount in creating a response for your customers.  Understanding that the customer sees messages first on the Windows operating system is important in creating a seamless experience. 

Do:

·         Create custom content that guides the customer

·         Acknowledge that the customer was referred by Windows Error Reporting

·         Use clear language and common terms

Don’t:

·         Put the customer on your main homepage

·         Send the customer to a general support forum

 

Clear information – Ensure that information provides clear and concise steps for the end-user.

Do:

·         Provide a direct download link to an update from the response message when possible

·         Provide a customized and dedicated landing page on your company’s website that contains an easily identifiable link to download a fix

·         Provide concise steps if user action is needed during an install

·         Provide clear instructions if a work around is available

Don’t:

·         Provide misleading links

·         Make the user click on three or more options before they reach an actionable solution

·         Provide non-actionable information

Current information – Keep your responses and landing pages up-to-date!  We regularly review response quality and make decisions to un-publish responses based on current information and customer ratings.

Do:

·         Check links on the landing pages after website updates

·         Monitor the Response Satisfaction report available on the Developer Portal

Don’t:

·         Let content sit for months unchanged

 

If you have questions about creating a good response, you are welcome to use the contact information available on the WER portal. We are always happy to help.

Using the Product Mapping File Upload Web Service

Yesterday we released updates to our web services offerings. These updates include the ability to automate the upload of the product mapping XML, getting a list of products mapped, getting the list of mapped files for a product, deleting a mapped file from a product, deleting mapped products and deleting a mapped file from multiple products.

This blog post lists the code (in C#) required to access our web services for uploading a product mapping file.

The code does 2 things:

  1. Login to the Winqual portal, since the authentication web service uses SOAP the code for accessing the authentication web service without a proxy is a bit cumbersome.
  2. Use the .NET WebClient class (you can also use the HttpWebRequest and HttpWebResponse classes for this) to upload the product mapping file and get the response.

Code for using the product mapping file upload service (also attached as a CS file):

   1:  string filePath = "path to the product mapping XML file e.g. c:\upload\CoolApplication.xml"; 
   2:  string baseUrl = "https://winqual.microsoft.com";
   3:  string userName = "your winqual username";
   4:  string password = "your winqual password";
   5:  //
   6:  // login
   7:  //
   8:  string loginUrl = string.Format(
   9:      "{0}/services/Authentication/Authentication.svc/BasicTicket"
  10:      , baseUrl);
  11:  string request = string.Format(
  12:      "<?xml version=\"1.0\" encoding=\"utf-8\"?><soap:Envelope 
xmlns:soap=\http://schemas.xmlsoap.org/soap/envelope/\ 
xmlns:xsi=\http://www.w3.org/2001/XMLSchema-instance\ 
xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><soap:Body>
<GetBasicTicket 
xmlns=\"https://winqual.microsoft.com/Services/Authentication/\">
<userName>{0}</userName><password>{1}</password>
</GetBasicTicket></soap:Body></soap:Envelope>"
  13:      , userName
  14:      , password);
  15:  WebClient loginClient = new WebClient();
  16:  loginClient.Headers.Add(HttpRequestHeader.ContentType, "text/xml");
  17:  loginClient.Headers.Add(
  18:      "SOAPAction",
  19:      "https://winqual.microsoft.com/Services
/Authentication/IBasicTicket/GetBasicTicket");
  20:  string loginResponse = loginClient.UploadString(loginUrl, request);
  21:   
  22:  //
  23:  // parse the response to get the encrypted ticket out
  24:  // TODO: Handle condition where the ticket is null in case of bad pwd, 
bad username or some other error
  25:  //
  26:   
  27:  int startingIndex = loginResponse.IndexOf("<GetBasicTicketResult>") 
+ "<GetBasicTicketResult>".Length;
  28:  int endingIndex = loginResponse.IndexOf("</GetBasicTicketResult>");
  29:  string encryptedTicket = loginResponse.Substring(startingIndex
, (endingIndex - startingIndex));
  30:   
  31:  //
  32:  // upload file
  33:  //
  34:  WebClient webClient = new WebClient();
  35:  webClient.Headers.Add("encryptedTicket", encryptedTicket);
  36:  byte[] responseBytes = webClient.UploadFile(baseUrl 
+ "/services/wer/user/fileupload.aspx", filePath);
  37:   
  38:  //
  39:  // the response that the WebClient gets may contain an updated ticket, so get that
  40:  //
  41:  if (webClient.ResponseHeaders["encryptedTicket"] != null)
  42:  {
  43:      encryptedTicket = webClient.ResponseHeaders["encryptedTicket"];
  44:  }
  45:   
  46:  //
  47:  // get the response
  48:  //
  49:  UTF8Encoding encoding = new UTF8Encoding();
  50:  string fileUploadResponse = encoding.GetString(responseBytes);
  51:   
  52:  //
  53:  // TODO: Check the response XML for success or error message
  54:  //
  55:   
  56:  return fileUploadResponse;

The XML response from the authentication service will be similar to the following when the login is successful:

<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
    <s:Body>
        <GetBasicTicketResponse 
xmlns="https://winqual.microsoft.com/Services/Authentication/">
                     <GetBasicTicketResult>
encrypted ticket string</GetBasicTicketResult> 
              </GetBasicTicketResponse>
      </s:Body>
</s:Envelope>

The XML response from the authentication service in case of a login failure will be similar to the following:

<s:Envelope xmlns:s="http://schemas.xmlsoap.org/soap/envelope/">
    <s:Body>
        <GetBasicTicketResponse 
xmlns="https://winqual.microsoft.com/Services/Authentication/">
            <GetBasicTicketResult a:nil="true" 
xmlns:a="http://www.w3.org/2001/XMLSchema-instance" /> 
        </GetBasicTicketResponse>
    </s:Body>
</s:Envelope>

The XML response when the upload is successful is similar to the following ATOM output:

<?xml version="1.0" encoding="utf-8" ?>
<feed xmlns=http://www.w3.org/2005/Atom 
xmlns:wer=http://schemas.microsoft.com/windowserrorreporting
 wer:status="ok">
    <title>Feed Success</title> 
    <link rel="alternate" type="text/html" 
href="https://winqual.microsoft.com/services/wer/user/fileupload.aspx" /> 
    <updated>2009-06-12 19:28:28Z</updated> 
    <id>Success</id> 
    <entry>
        <updated>2009-06-12 19:28:28Z</updated> 
        <published>2009-06-12 19:28:28Z</published> 
        <title>File uploaded successfully</title> 
        <id>File uploaded successfully</id> 
        <wer:additionalInformation /> 
    </entry>
</feed>

You can check the file upload by going to the website:

https://winqual.microsoft.com/member/wer/user/ManageProductMappings.aspx

Once you sort the table by the Map Date column in descending order, you should see your uploaded product.

clip_image002

You can see the files for the product uploaded by clicking the icon under the File Mappings column.

Windows Error Reporting – Vote for a fix!

If you are a developer and reading this blog you may already know of the direct benefits provided by users submitting error reports, if you are an end-user of an application, on Windows, developers are relying on you to help identify solutions to issues that are difficult to reproduce.

I want to take this short blog entry on what good Windows Error Reporting brings to the Windows Operating System and eco-system.  I am including hyperlinks that give a view from 3rd parties on the good things that Windows Error Reporting enables for developers on the Windows OS.

The title of this blog posting was inspired by the WindowsITPro.com article below where each error report sent to Windows Error Reporting (WER) is a vote for a fix, although the article was written a few years ago, the theme of voting remains the same. Although not like texting a vote into American Idol, I argue that sending a Windows Error Report is easier and can end up fixing the issue you reported in the end!  

Vista WER Settings UI Deciphered

This post is a follow up to the original post I did about WER Settings & UI for Windows 7. While that post outlined the settings UI in Windows 7, this post covers the WER settings UI for Windows Vista.

WER Consent Settings: The various WER consent levels are identical across Windows Vista and Windows 7. Please review this post to get an understanding of the various consent levels first. The rest of this blog post explains how these consent options are exposed in the Windows Vista UI.

The WER setting in Windows Vista is exposed in the ‘Problem Reports & Solutions’ (aka PRS) Control Panel.

image

Clicking the ‘Change settings’ link in PRS, opens up the basic settings page shown below. These settings allow configuration of WER consent settings for that specific logged in user only.

image

Clicking on the ‘Advanced Settings’ link on this page opens the, well, advanced settings page shown below.

image

 

Clicking on the lower ‘Change Settings’ button, brings up a dialog which allows setting consent for all users on the machine. Here is how that dialog looks:

image

Hope this clarifies the various WER settings and how they can be configured on Windows Vista.

Problem Steps Recorder (PSR.exe) + Windows Error Reporting = Another tool to help find solutions to software defects

There is a bunch of information on the web about PSR and one of the best is this video on CNET:  http://cnettv.cnet.com/2001-1_53-50005144.html  as you can see a demo of the tool working for an end-user perspective, it behaves differently when invoked by WER as described below.  The key difference is that WER doesn’t collect screenshots.

From an error reporting perspective we plan on exposing this via the WER Services.  The concept is that if you aren’t able to use standard mini-dumps or heap dumps to find the source of the problem, WER Services will allow the PSR tool to be run (with user consent)

Example of screen-shot shown to user:

image

When we enable an error Bucket to collect PSR data, here is the workflow:

  1. User prompted with a request for additional data
  2. PSR waits until next time the same application is launched
  3. PSR runs silently in the background while user is using the application
  4. PSR captures descriptions of steps user has taken (without screenshots)
  5. IF the same unhandled exception occurs (BucketID)
    1. The Memory dump, Additional Files, and the PSR output is sent in a CAB file through WER
    2. Developer’s can now download CAB files associated with a particular event that contain these steps a user took leading up to the crash
    3. Developer’s can try to reproduce the error locally following these steps.
  6. After PSR steps are sent with CAB the PSR tool is disabled for user

IMPORTANT NOTES:

  • PSR.exe when invoked by WER does NOT capture screenshots, see example output below.
  • Only CAB files that contain PSR steps will be submitted to WER, this means that possibly fewer cabs will be sent for a particular Bucket.
  • When Windows 7 firsts launches this feature will only be manually enabled by the team behind wer@microsoft.com.

Other links and blog posts:

http://technet.microsoft.com/en-us/magazine/dd464813.aspx

http://blogs.msdn.com/cjacks/archive/2009/02/25/deciphering-the-command-line-configuration-of-the-windows-7-problem-steps-recorder.aspx

http://blogs.msdn.com/tims/archive/2009/01/12/the-bumper-list-of-windows-7-secrets.aspx

Sample output:

Problem Step 1: (3/27/2009 7:53:51 PM) User right click on "user consent (editable text)" in "Problem Steps Recorder (PSR.exe) + Windows Error Reporting = Another tool to help find solutions to software defects - Windows Live Writer"

Problem Step  2: (3/27/2009 7:53:53 PM) User left click on "Hyperlink... (menu item)"

Problem Step 3: (3/27/2009 7:53:55 PM) User keyboard input in "Insert Hyperlink" [... Enter]

Let There Be Hangs: Part 4 – Hashes and Type Codes and XProc, Oh My!
Hang Bucketing, A Better Way

In the previous post I gave a brief introduction of how the first version of hang reporting was implemented using the existing crash reporting infrastructure.

Eventually (after Windows XP shipped) a new general purpose event reporting and bucketing mechanism was built. In a nutshell, this mechanism provides a very flexible way to report incidents to the WER pipeline.  It supports a custom named event (internally at Microsoft dubbed “generic events”) and up to 10 custom bucketing parameters P1...Pn...P10.

During the development of Windows Vista, we decided to do something about the problematic hang bucketing and leverage the new general purpose reporting to gain a much better client-side classification of hang issues.  Before I explain what bucketing parameters these have, let me discuss a few other new things in Vista:

Another problem with Windows XP hang reporting was that the application process was often hung waiting on an external process.  The report would only include a memory dump of the hung application and developers were often forced to give up not able to debug the other process.  (Now, granted, you very often don't need to debug the other process to fix a hang, but that is a discussion for a later blog post.)

To solve this problem, the GetThreadWaitChain API was created.  This API allows the caller to discover the blocking graph (or “wait chain”) for a given thread.  A trivial example: Thread A is waiting to enter a critical section owned by Thread B which is making a blocking SendMessage call to Thread C which is waiting on a mutex owned by Thread D which is running in another process.  The output from the API provides the caller all of this information.

It's this mechanism that hang reporting uses in Vista to discover not only the wait chain information but to collect memory dumps of external processes.

Back to the bucketing parameters – the hang reporting infrastructure actually reports hangs to two different generic events: AppHangB1 and AppHangXProcB1 (yes, B1 stands for Beta 1.  Don’t ask…).  On Winqual, AppHangB1 is shown as Event Type "Hang" and AppHangXProcB1 is shown as "Hang XProc".  AppHangB1 has 5 parameters (by convention P1 & P2 are typically Application Name & Version):

P1 - Application Name
P2 - Application Version
P3 - Application Timestamp
P4 - Stack Hash
P5 - Type Code

AppHangXProcB1 has those 5 and adds 2 more:

P6 - Waiting On Application Name
P7 - Waiting On Application Version

Most of these parameters are self explanatory.  The two that might not be are P4 and P5:

Stack Hash (P4) - hang reporting traverses the wait chain and for the final thread in the chain before it jumps to a different process, we generate a hash based on the thread’s stack back trace. P4 is a restricted hash which means we chop the MD5 128-bit hash down to 16 bits.  This prevents any wild spraying of buckets and when we studied it we were still achieving ~85% uniqueness.

Type Code (P5) - this is a bit field based in part on the WCT_OBJECT_TYPEs found during wait chain traversal with GetThreadWaitChain (e.g., mutex, COM, etc.) but also on a few other items too - like if there's a deadlock or if the hang report came from End Task in Task Manager (more blog fodder).

As rich and wonderful as hang bucketing has become in Windows Vista, there are still edge cases (just as there are in crash bucketing) where a bucket does not uniquely identify a single bug.  In a future post I will discuss bucketing more generally and how we REALLY determine and quantify bugs using WER data at Microsoft.

Let There Be Hangs: Part 2 – WER History 101

Crashes Suck

In the beginning, we needed a way to close the loop with our customers in order to ease the pain felt from software defects (bugs) that caused crashes.  A simple client service was produced that collected crash dumps from Windows desktops and sent them back to Microsoft for analysis. The crash dumps were debugged, bugs exposed, fixes were made, problems went away. Lather, rinse, repeat.  Note: This is how user mode error reporting began.  Interestingly, kernel mode error reporting (bluescreens) had its origin in a roughly parallel timeframe (actually a little earlier) but it began in a completely different group at Microsoft. For the rest of this series of posts I’m talking about the user mode side of WER.

Buckets

An early design assumption was that this client had zero analysis ability and so we needed a simple and deterministic way to organize the data that would allow both the client and the backend servers at Microsoft to speak to each other in a common language.  Oh, and the solution needed to be (*ahem*) scalable. That common language is the unique combination of event parameters which give us the Event ID. Internally at Microsoft these are actually known as bucket IDs – or just buckets.  It’s more natural for me to talk about buckets and I’ll try to be consistent, but just in case, know that “bucket” and “event” are mostly interchangeable in the context of error reporting.

Originally, plain old crashes were bucketed using 5 parameters:

  1. The application’s name
  2. The application’s version
  3. The name of the module containing the instruction that causes the crash exception
  4. The version of that module
  5. The byte offset into that module where that instruction resides

crash event with hits The idea was that crashes with the same set of parameters were caused by the same bug. It turns out that this isn’t always true (We’ll save that for a later post) but in general it works pretty well. Today, this “client” is built as a core component of Windows and became generally referred to Windows Error Reporting (WER) Services.  In Vista we added 3 more parameters to plain old crash reporting: file link timestamps for both the application and modules (to try and avoid some issues of naming collisions) and exception code.  And by the way, the client itself has become smarter and in fact does some cursory analysis when deriving bucketing parameters.

Notice that you can tell the explorer crash to the right is from Windows XP because those extra 3 parameters don’t have data.

< Read Part 1 [Part 2] Read Part 3 >

Let There Be Hangs: Part 3 – The ‘hungapp’ module

Hang Bucketing, v1

On Windows XP, hangs have it rough.  Like a younger sibling, error reporting for hangs has to wear the hand-me-down clothes of crash reporting – it piggybacks on the same 5 fixed bucketing parameters used by crash reporting. However with a hang there is no exception context and so there is no faulting instruction, therefore there is no module name, module version or module offset.  So on XP, hangs really only have 2 effective bucketing parameters (application name and application version).  The other parameters are set where module name is "hungapp", module version is "0.0.0.0", and the offset is 0.

hungapp box explorer

This means all of particular version of an application’s hangs ended up in a single bucket.  Knowing this, you might correctly guess that these buckets have very high hit counts when compared with crash buckets for the same app name/version (though some corruption buckets are not far behind).  When someone runs a generalized query similar to, "what are my highest hitting XP buckets this week?", the top end of the results are often littered with hang buckets.  A naive conclusion would be "Oh my! Hangs are a huge problem and we must focus all our effort to eradicate them!"  While hangs are indeed a big problem, arriving at this conclusion based on bucket hit volume is incorrect.  Also, when the actual failures contained in these hang buckets are studied, it is quickly apparent they are composing many bugs – not just the ideal 1 (or few). This is a big reason development teams historically struggled to make progress with hang bugs - the data was difficult to sort and measure.

< Read Part 2 [Part 3] Read Part 4 >

Let There Be Hangs: Part 1 (Not Responding)

Well hang reports anyway…

Those of you signed up on the Winqual site are likely familiar with the "Event ID" and what it (ideally) represents – that is, a basic demarcation of unique software defects.  Additionally you’re probably more familiar with crash-related Event IDs like “Crash 32-bit” and “Crash 64-bit”.  But what about the other event types?  Specifically “Hang” and “Hang XProc”?  And by the way – what’s a crash in module “hungapp”?

In this and later posts I’m going to explain a little bit about error reporting and a little bit more about hangs – what they are, how they’re detected, reported, organized, counted, etc. Before getting into all that, however, we need a (very) brief history lesson.

Ghost Windows

Some of you might remember something like the following scene from the days of Windows NT 4 (and maybe you’ve seen similar scenes in more recent times):

NT4NoGhost

This is a screenshot of CALC.EXE moved across an intentionally hung instance of NOTEPAD.EXE.

Without getting into too much nitty gritty, a thread that creates GUI elements, namely windows, has an implicit contract with the desktop window manager that it will service messages that arrive in its message queue... in a timely fashion. “Servicing” means retrieving and dispatching the messages (aka pumping messages).  There is plenty of content on MSDN that covers the specific APIs and mechanisms used for this and plenty of content elsewhere about the interesting peculiarities of this area.  What "timely fashion" means is somewhat a matter of debate among developers and users. To the window manager, it means 5 seconds by default.  Our user research shows that this is actually quite generous.

If a thread stops pumping messages from its message queue, bad things start to happen. In such cases, the thread is often off busy working on whatever task the user just “asked” it to do (e.g. opening a file or recalculating some total or talking to a web service over the internet) and there’s a good chance that it didn’t update its UI to indicate that it’s off doing this work; it almost certainly hasn’t shown any UI resulting from that work. But there is a worse and more fundamental problem – while it’s away, the messages it isn’t pumping could be mouse, keyboard, or touch input along with paint messages.  Not pumping those messages would result in a window stuck on the user’s screen – the user wouldn’t be able to move, minimize, or close it and if they moved something in front of it, they’d get something like the jaggy mess in the screen shot above.

I say would because beginning with Windows 2000, we added functionality to the window manager to handle these cases.  The concept is actually quite simple – the window manager watches for when a thread has pending input in its queue but hasn’t serviced it for more than 5 seconds. When the situation is detected (aka IsHungAppWindow), the window manager does a presto chango maneuver whereby it hides the window whos thread isn’t pumping input messages and seamlessly shows a replacement window (known as a “ghost” window) with its client area filled with the real window’s last known good client area bits.

While the original thread isn’t responding, the window manager manages both windows in parallel, to the extent that the application doesn’t know this is happening. For example, IsWindowVisible() will actually return true for the hidden/unresponsive window. The only visual difference (initially) to the user is that the text “(Not Responding)” is appended to the ghost window’s title. Using the ghost window, the user can effectively move, minimize, and even close the unresponsive application window. When (if) the thread starts pumping messages again, the original window is re-shown and the ghost window goes away. By the way, this window manager feature is called (obviously) window ghosting - and naturally, we offer an API you should almost never use: DisableProcessWindowsGhosting.

While we were working on Windows XP, we realized that it might be nice to know about these unresponsive applications.  We had already started to collect data about crashes, which were clearly annoying to customers, and some schools of thought (and some research) indicated that unresponsive UI was even more annoying -some might even use words like “infuriating” (I know I do).  So, we decided to wire up a ghost window’s close button to the infrastructure that sent crash reports back to Microsoft… and voila!, hang reporting was born.

[Part 1] Read part 2 >

FAQ

How do I get started with Windows Error Reporting (WER)?

How do I find Windows Error Reporting information on a PC?

How do we match up crash data (Windows Error Reporting signatures) with companies?

What files should I map?

How do we detect applications hangs in Windows?

What is a bucket?

What is a “special” exception?

What are the different types of memory dumps?

When will I see data?


Q: How do I get started with Windows Error Reporting (WER)?

A: As an Independent Software Vendor (ISV) who develops software applications for the Windows Operating Systems, you are able to register to receive Software Quality information from the Windows Quality website.

Please follow this link to a page with directions on how to get started! (http://www.microsoft.com/whdc/maintain/StartWER.mspx).


Q: How Do I find Windows Error Reporting information on a PC?

A:  Here are the solutions for finding WER data on XP, Vista, and Windows 7.

OS WER Information
Windows XP Event Log
Windows Server 2003 Event Log
Windows Vista Event Log
”wercon”
Windows Server 2008 Event Log
”wercon”
Windows 7 Event Log
Action Center

 

Windows XP/ Server 2003


On XP / Server 2003 you can find information about error reports in the Application Log.
When a crash occurs the bucketing parameters are written to the Application Event log with the event source of “Application Error” and event ID 1000
When the crash is submitted the bucket ID is written to the Application Log with the same event source and has an event ID 1001.

Windows Vista / Server 2008


On Vista you can use the “wercon” control panel accessible via:
Start >Control Panel >System and Maintenance >Problem Reports and Solutions >View Problem History
Or
By typing “wercon” in the Search Box at the base of the Start Menu.

clip_image002

Click on the “View problem history” link

clip_image002[9]

Double click on on a line item below the Application Name to view details:

clip_image002[11]

Notice the bucketID and the Copy to Clipboard feature.

Windows 7

Using the Action Center: 

Open the Action Center by clicking on the flag icon in the system tray

clip_image001[5]


Select the Action Icon from the system tray, and click Open Action Center.

By expanding the Maintenance toggle button, the customer can see the View System History link.image

After selecting the View System History link, the customer can click the View problem reports and responses link.

View Problem Response and Reports/Reliability Panel

Or, you can type error into the Search bar on the Start Menu and click the “View problem History”

StartMenu "View Problem History"

Select View Problem History, and then select View problem reports and responses.


Q: How do we match up crash data?

A: Here is the general mapping logic we used based on crash data that is gathered by Windows Error Reporting (WER) client.  Notice the difference between Pre Windows Vista, and Post Windows Vista.

How do we match up crash signatures with Files that are mapped?

We use 4-6 basic parameters to match mapped files in the developer portal, Windows XP has fewer and matches on File Name and File Version for either Applications and Modules. With this in mind if you are still developing for Windows XP it is a best practice to come up with a unique file name and file version. This best practice translates to any other development environment, but is important with Windows XP as you can see the parameters below that the Windows XP WER Client sends to our service.

Windows XP (Includes Windows Server 2003)

WER client sends these parameters we use to match crashes:

  • Application Name
  • Application Version

OR

  • Module Name
  • Module Version

With Windows XP/Windows Server 2003 crashes we have less data to use for matching crash signatures with files mapped on the portal.

Windows Vista++ (includes Windows Server 2008)

WER client sends these parameters we use to match crashes:

  • Application Name
  • Application Version
  • Application File Link Date

OR

  • Module Name
  • Module Version
  • Module File Link Date

We use the link date to get a more accurate match on the files mapped on the portal.


Q: What files should I map

A: Given the way we have implemented mapping logic the best policy is to map only the files where you own or maintain the source code.  If you ship 3rd party or open source modules with your software where you do not own or maintain source then you will not be able to take action on these error reports, and this information will just be added noise and complexity.

Even if you don’t map 3rd party modules you will still see crash reports where any module crashed your application.

NOTE: We have mapping collision detection that will preclude you from mapping certain files.

Example:

Product Name: FabrikamReportingApplication

Product Binary files:

File Name File Version Description Should I map?
FabrikamReportGUI.exe 1.0.0.0 User interface for my reporting product Yes
FabrikamWebserviceDataImport.dll 1.0.0.1 Library used to import and transform data from web services. Yes
ThirdPartyChartControl.dll 5.6.1.3 3rd party DLL I’ve licensed to ship with my report product No
OpenSourceDataVisualizer.exe 4.5.3.1 An open source application that helps with visualizing multiple data sources (I don’t maintain code for this) No
OpenSourceGridControl.dll 2.3.1.4 An open source library where I’ve incorporated the source code into my build and maintain changes and updates for this source. Yes

With the mapping scenario above you will still see crashes where ThirdPartyChartControl.dll caused a crash in the FabrikamReportGUI.exe application.


Q: What is a bucket?

A: A bucket uses a set of parameters gathered by the WER client to describe a WER Event.  For crash events the bucketing parameters are Application Name, Application Version, Application Build Date, Module Name, Module Version, Module Build Date, Exception Code, and Code Offset.

There are also Generic Buckets that are made up of up to 10 parameters.


Q: What is “special” exception?

A: A special exception is a grouping of two WER event types: Buffer Overruns (BEX) where the exception code = C0000409, and Data Execution Prevention (DEP) where the exception code = C0000005.  These two event types are related to security of software and can also be viewed in the Security Alerts section of the website.  In order for the WER client to capture BEX events you need to use the /GS command when compiling your software.


Q: What are the different types of memory dumps?

A: Below is a table that contains brief descriptions of the different types of dump files that can appear in .CAB file.

Extension Description
.mdmp minidump (Default dump collection type, stacks, and loaded modules)
.hdmp .mdmp += heap data
.kdmp on-demand kernel mode minidump (snapshot of the kernel mode portion of a given user mode process) only sent if WCT (Wait Chain Traversal) added it, or was explicitly requested for a given ibucket by a user.
memory.dmp Old style (<=XPSP2) heap dump.  This is not actually a debugging dump file, instead its just a blob of read only pages
atk.dmp application termination kernel (taken when an exited process takes more than 10 seconds to actually terminate)
hu.dmp heap user
mu.dmp mini user
odk.dmp on demand kernel
axhu.dmp arbitrary x-process heap user
axmu.dmp arbitrary x-process mini user
wxhu.dmp wct x-process heap user (wct is wait chain traversal)
wxmu.dmp wct x-process mini user

 


Q: When will I see data?

A:  After mapping files we will start to pull any existing data we have for crashes. Generally you will see Hit data within a day or so. Users who setup Vista and Windows 7 with recommended settings (~80%) will automatically send us the hit data without any prompting, so you can see hits without cabs.

By default we collect 10 cab (minidump) files per event, since a user still has to click on the “send additional information” button for us to get the crash dump it can take some time to get cab files.

Once we receive cab files for an event you will generally be able to see these cabs within a few hours of us recieving them.

An overview of WER consent settings and corresponding UI behavior

This post gives a high level overview of the various consent settings WER supports and the corresponding experience a user has when opted into any of those setting. Before going into further details, I want to clarify a couple of terms used here:

  • WER Client – The binaries that ship with the Windows OS and report problems to Microsoft.
  • WER Service – The backend service hosted by Microsoft to which the WER client reports problems.

Windows users have three different levels to choose from when opting into reporting errors to Microsoft. these levels are:

  • Level 1 – If a user chooses this option, Windows does not report problems to Microsoft automatically. Instead, every time a fault (like a crash or hang) occurs on the user’s machine, a dialog is presented to the user, to get explicit consent to report the fault to Microsoft.
  • Level 2 (Is the default recommended by Windows - more details about defaults later in the post) – For users choosing this option, Windows automatically reports the problem signature (aka WER event parameters) to the Microsoft WER Service. However, if the backend WER Service requests for additional information from the machine (like mini dumps, log files), a dialog is presented to the user to get explicit consent before uploading the additional information.
  • Level 3 – For users choosing this option, Windows automatically reports the WER event parameters to the backend WER Service, and if the WER Service requests for additional information that does not contain any personally identifiable information (PII), then that information is uploaded to the service automatically. An example of such information is a mini dump, which has an extremely low probability of containing PII. However, if the information requested by the WER Service has the potential to contain PII (for example, a heap dump), a dialog is presented to the user to get explicit consent before uploading the additional information.

Note that the three consent levels only come into play if WER is enabled for the user/machine. If WER is disabled, then the user is never even prompted to report problems.

Following is a summary of the typical Windows 7 user experience for each of the consent levels (Note: Windows 7 is still in Beta and the UI is still susceptible to changes):

  • User experience when WER is disabled: Error reporting does not kick in. A simple dialog informing the user about the crash is shown.

      image 

  • User experience when WER consent level is set to 1:

     Step 1: On detecting a crash, WER client presents UI to get explicit user consent to report WER event parameters to Microsoft’s WER backend service

      image 

      Step 2: On getting consent in step 1, WER client starts reporting parameters to Microsoft

      image

      Step 3: If the WER Service requests additional data (like heap dumps, log files etc.), WER client starts collecting the information.

      image

      Step 4: After completing data collection, WER client asks user for explicit consent to upload data to Microsoft. If user agrees, WER client uploads the data. If user clicks ‘Cancel’, the data is discarded and not sent to Microsoft.

      image

  • User experience when consent level is set to 2:

      Step 1: On detecting a crash, WER client automatically starts reporting WER event parameters to Microsoft’s WER Service.

      image

 

 

      Step 2: If the WER Service requests additional data (like mini dump, heap dumps, log files etc.), WER client starts collecting the information.

      image

      Step 3: After completing data collection, WER client asks user for explicit consent to upload data to Microsoft. If user agrees, WER client uploads the data.

      image 

  • User experience when consent level is set to 3:

      Step 1: On detecting a crash, WER client automatically starts reporting WER event parameters to Microsoft’s WER backend service.

      image

      Step 2: If the WER Service requests additional data (like mini dump, heap dumps, log files etc.), WER client starts collecting the information

      image

      Step 3: If the data collected is considered safe (i.e. not containing PII; like a minidump) WER client uploads that data automatically without asking for explicit user consent. If however, the data isn't considered safe (like heap dumps, log files etc.), WER client asks user for explicit consent to upload data to the WER Service. If user clicks ‘Send Information’, WER uploads the data.

      image

Note: It is noteworthy that WER client never uploads data containing personally identifiable information (like heap dumps) automatically. Explicit user consent it always needed to upload such information. This design decision has been made to respect the privacy of Windows users. 

So how does a user configure/change WER settings?

The first place where a user chooses to opt into WER is the Windows Out of the Box Experience (aka Windows OOBE). The first time a new Windows 7 machine boots up, an OOBE screen is presented to the user which opts the machine into one of two WER consent levels depending on the user’s selection.

image

If the user does not choose “recommended settings” in OOBE, another “up-sell” dialog is presented to the user in the Windows Troubleshooting Control Panel

image

In the above dialog, if the user clicks ‘Yes’, his consent is bumped up to level 2. Note however, if the user has already set his consent level to higher than 2, it is not brought down.

The OOBE and the up-sell dialog are presented to the user only once in the entire lifetime of his Windows 7 installation. If the user wants to change WER settings there after, he can do so by accessing the settings page via the Action Center Control Panel

 

image 

This page allows setting the consent only for the specific user interacting with this page. However, if an administrator wants to configure settings for all users on a machine, he can do so by clicking on “Change report settings for all users” and making the same choice in that dialog.

Hope this information helps you understand the various WER settings, the corresponding Windows UI behavior for each of the settings, and the exactly how to configure/edit these settings.

Happy Error Reporting!

Windows 7 Beta Data – Windows 7 Release Candidate Data

image

image

Today we are releasing a report showing the companies Windows 7 beta crash data, and now the Windows 7 Release Candidate crash data!  This reports’ focus is to help companies see how well their applications are performing on the Windows 7 beta.  The report has two section, one highlights the top user-mode issues and the other highlights the top kernel-mode issues.  The data between the two of these sections are slightly different I will explain what each data point is trying to communicate.

Navigate to: Windows Error Reporting > Software > My Reports

And select the Windows7BetaDataFailures Report. (Being retired soon).image Select the Windows7_RC_Failures report
image

Note: Download !analyze to help in debugging the problems shown in this report.

User-mode:  What is a user-mode failure?

A failure is chosen based on analysis of the cab by !analyze, a debugger extension that is part of the Debugging Tools For Windows.  A Failure is a unique combination of Symbol, Problem Class and Exception code.  The failure may aggregate multiple Buckets (EventID’s).  The results of the analysis can be dependent on the availability of symbols (PDB files) for the modules related to the failure.

Note:  Since we Microsoft does not have private symbols for debugging memory dump files failures are based on the best available analysis of memory dump files.  In order to get the most accurate depiction of what the failure is using private symbols in debugging will quickly show where the precise origin of a failure occurs.

Rank Shows the overall rank of this particular failure across all of Windows 7 within the specific Event Type listed below.
Product Name Shows the name of the product detected in this failure.
Symbol Shows the symbol (Module/Function) that caused the failure detected via !analyze.
Problem Class Shows a general description of the nature of the problem.
Exception Code Shows the OS Exception Code.
Regression Percentage Shows the percentage of time this failure has occurred on Windows 7 verses previous versions of Windows.  (100% Regression percentage = Detected only on Windows 7)
Module Version Range Shows the versions of a file where this failure was detected via !analyze.
View Cabs A list of cabs that contain memory dumps related to this specific failure.

Kernel-mode: What is a kernel-mode bucket?

In kernel mode, the debugger walks the stack of the crashing thread and determines where to place the blame for the crash. Typically, a crash bucket name is derived from the debugger's choice of bugcheck stopcode + the blamed driver + the function name (if any, symbols are needed for this) + the function offset. All crashes assigned to a given bucket name can usually be considered as failing for the same reason.

Rank Shows where the crash bucket ranks in the general Win7 crash population
Product Name Shows  Product detected in this hardware bucket, where available
Bucket Name Combination of bugcheck stopcode, failing driver, failing function, and function offset. For example: 0xD1_tdx!TdxEventReceiveConnection+2d3
File Name Name of the driver chosen as the cause of the crash via !analyze
Version Shows the version of the file that was detected via !analyze
Hits Shows the number of hits the crash bucket has seen in the past 30 days
Regression Shows if this bucket exists only in Windows 7 or also on Previous versions (True = Detected only on Windows 7)
Driver Version Range Shows the range of versions where this particular bucket has been detected.
View Cabs Shows a list of cabs that contain memory dumps related to this specific kernel bucket.
More Posts Next page »
Page view tracker