Posts
  • PFE - Developer Notes for the Field

    Best Practice - <GeneratePublisherEvidence> in ASPNET.CONFIG

    • 2 Comments

    Best Practice Recommendation

    Add the following line to your ASPNET.CONFIG or APP.CONFIG file:

    <?xml version="1.0" encoding="utf-8"?>
    <configuration>
        <runtime>
            <generatePublisherEvidence enabled="false"/>
        </runtime>
    </configuration>

    Note the ASPNET.CONFIG file is located in Framework Directory for the version of the Framework you are using.  For example for a 64-bit ASP.NET application it would be:

    c:\Windows\Microsoft.NET\Framework64\v2.0.50727

    For a 32-bit application it would be:

    c:\Windows\Microsoft.NET\Framework\v2.0.50727

    Details

    I have seen this a bunch of times while onsite.  The problem goes something like this:

    When I restart my ASP.NET application the initial page load is very slow.  Sometimes upwards of 30+ seconds. 

    Many people just blame this on “.NET Startup” costs but there is no out of the box reason that an ASP.NET application should take that long to load.  Some applications do work on startup which can cause a startup slow down but there are other things that cause slow downs.  A common cause that I have seen often recently is Certificate Revocation List (CRL) checking when generating Publisher Evidence for Code Access Security (CAS).

    A little background – CAS is feature in .NET that allows you to have more granular control over what code can execute in your process.  Basically there are 3 parts:

    1. Evidence – Information that a module/code presents to the runtime.  This can be where the module was loaded from, the hash of the binary, strong name, and importantly for this case the Authenticode signature that identifies a modules publisher.
    2. Permissions Set – Group of Permissions to give code (Access to File System, Access to AD, Access to Registry)
    3. Code Group – The evidence is used to provide membership in a code group.  Permission Sets are granted to a code group.

    So when a module loads it presents a bunch of evidence to the CLR and the CLR validates it.  One type of evidence is the “publisher” of the module.  This evidence is validated by looking at the Authenticode signature which involves a Certificate.  When validating the Certificate the OS walks the chain of Certificates and tries to download the Certificate Revocation List from a server on the internet.  This is where the slowdown occurs.

    A lot of servers do not have access to make calls out to internet.  It is either explicitly blocked, the server might be on a secure network, or a proxy server might require credentials to gain access to the internet.  If the DNS/network returns quickly with a failure the OS check will move on but if the DNS/network is slow or does not respond at all to the request we have to timeout. 

    This can occur for multiple modules because we create this evidence for each module that is loaded.  However if we have looked for a CRL and failed we will not recheck.  However different certificates have different CRLs.  For instance a VeriSign Certificate may have one CRL URL but a Microsoft Certificate will have a different one.

    Since this probe can slow things down it is best to just avoid the probe if you do not need it.  For .NET the only reason you would need it is if you are setting Code Access Security based on the module Publisher.  Because this can cause potential slow downs and you do not need to occur this penalty you can just disable the generation of the Publisher Evidence when your module is loaded.  To disable this use the <generatePublisherEvidence> Application configuration.  Just set the enabled property to false and you will avoid all of this.

    Now for ASP.NET applications it was not immediately obvious how to do this but it turns out that you cannot add this to an applications Web.Config but you can add it to the ASPNET.CONFIG file in the Framework directory.  For other applications just add the attribute to the APP.CONFIG file.

    In closing there are several blog entries that do a great job of demonstrating how this will show up in the debugger and other details on CRL issues and workarounds:

    We are highlighting this as the first in a series of general best practice recommendations.

    NOTE – If you have .NET 2.0 RTM you will need this hotfix - http://support.microsoft.com/default.aspx/kb/936707

  • PFE - Developer Notes for the Field

    Memory Based Recycling in IIS 6.0

    • 3 Comments

    Customers frequently ask questions regarding the recycling options for Application Pools in IIS.

    Several of those options are self explanatory, whereas others need a bit of analysis.

    I’m going to focus on the Memory Recycling options, which allow IIS to monitor worker processes and recycle them based on configured memory limits.

    Recycling application pools is not necessarily a sign of problems in the application being served. Memory fragmentation and other natural degradation cannot be avoided and recycling ensures that the applications are periodically cleaned up for enhanced resilience and performance.

    However, recycling can also be a way to work around issues that cannot be easily tackled. For instance, consider the scenario in which an application uses a third party component, which has a memory leak. In this case, the solution is to obtain a new version of the component with the problem resolved. If this is not an option, and the component can run for a long period of time without compromising the stability of the server, then memory based recycling can be a mitigating solution for the memory leak.

    Even when a solution to a problem is identified, but might take time to implement, memory-based recycling can provide a temporary workaround, until a permanent solution is in place.

    As mentioned before, memory fragmentation is another problem that can affect the stability of an application, causing out-of-memory errors when there seems to be enough memory to satisfy the allocation requests.

    In any case, setting memory limits on application pools can be an effective way to contain any unforeseen situation in which an application that “behaves well” goes haywire. The advantage of setting memory limits on well behaved applications is that in the unlike case something goes wrong, there is a trace left in the event viewer, providing a first place where research of the issue can start.

    When to configure Memory Recycling

    In most scenarios, recycling based on a schedule should be sufficient in order to “refresh” the worker processes at specific points in time. Note that the periodic recycle is the default, with a period of 29 hours (1740 minutes). This can be an inconvenience, since each recycle would occur at different times of a day, eventually occurring during peak times.

    If you have determined that you have to recycle your application pool based on memory threshold, it implies that you have established a baseline for your application and that you know your application’s memory usage patterns. This is a very important assumption, since in order to properly configure memory thresholds you need to understand how the application is using memory, and when it is appropriate to recycle the application based on that usage.

    Configuration Options

    There are two inclusive options that can be configured for application pool recycling:

    image

    Figure 1 - Application Pool properties to configure Memory Recycling

    Maximum Virtual Memory

    This setting sets a threshold limit on the Virtual Memory. This is the memory (Virtual Address Space) that the application has used plus the memory it has reserved but not committed. To understand how the application uses this type of memory, you can monitor it by means of the Process – Virtual Bytes counter in Performance Monitor.

    For instance, if you receive out of memory errors, but less than 800MB are reported as consumed, it is often a sign of memory fragmentation.

    Maximum Used Memory

    This setting sets a threshold limit on the Used Memory. This is the application’s private memory, the non-shared portion of the application’s memory. You can use the Process – Private Bytes counter in Performance Monitor to understand how the application uses this memory.

    In the scenarios mentioned before, this setting would be used when you have detected a memory leak which cannot avoid (or is not cost effective to correct). This setting would set a “cap” on how much memory the application is “allowed” to leak before the application is restarted

    Recycle Event Logging

    Depending on the configuration of the web server and application pools, every time a process is recycled, an event may be logged in the System log.

    By default, only certain recycle events are logged, depending on the cause for the recycle. Timed and Memory based recycling events are logged, whereas all other events are not.

    This setting is managed by the LogEventOnRecycle metabase property for application pools. This property is a byte flag. The each bit indicates a reason for a recycle. Turning on a bit instructs IIS that the particular recycle event should be logged. The following table indicates the available values for the flag:

    Flag

    Description

    Value

    AppPoolRecycleTime

    The worker process is recycled after a specified elapsed time.

    1

    AppPoolRecycleRequests

    The worker process is recycled after a specified number of requests

    2

    AppPoolRecycleSchedule

    The worker process is recycled at specified times.

    4

    AppPoolRecycleMemory

    The worker process is recycled once a specified amount of used or virtual memory, expressed in megabytes, is in use.

    8

    AppPoolRecycleIsapiUnhealthy

    The worker process is recycled if IIS finds that an ISAPI is unhealthy.

    16

    AppPoolRecycleOnDemand

    The worker process is recycled on demand by an administrator.

    32

    AppPoolRecycleConfigChange

    The worker process is recycled after configuration changes are made.

    64

    AppPoolRecyclePrivateMemory

    The worker process is recycled when private memory reaches a specified amount.

    128

    As mentioned before, only some of the events are configured to be logged by default. This default value is 137 (1 + 8 + 128).

    It’s best to configure all the events to be logged. This will ensure that you understand the reasons why your application is being recycled. For information on how to configure the LogEventOnRecycle metabase property, see the support article at http://support.microsoft.com/kb/332088.

    Below are the events logged because the memory limit thresholds have been reached:

    image

    Figure 2 - The application pool reached the Used Memory Threshold

    image

    Figure 3 - The application reached the Virtual Memory threshold

    Note that the article lists the 1177 Event ID for Used (Private) Memory recycling, but in Event viewer it shows as Event ID 1117

    Determining a threshold

    Unlike other settings, memory recycling is probably the setting that requires the most analysis, since the memory of a system may behave differently depending on various factors, such as the processor architecture, running applications, usage patterns, /3GB switch, implementation of Web Gardens, etc.

    In order to maximize service uptime, IIS 6.0 does Overlapped recycling by default. This means that when a worker process is due for a recycle, a new process is spawned and only when this new process is ready to start processing requests does the recycle of the old process actually occur. With overlapped mode, there will be two processes running at one point in time. This is one of the reasons why it is very important to understand the memory usage patterns of the application. If the application has a large memory footprint at startup, having two processes running concurrently could starve the system’s memory. For example, in a 32 bit Windows 2003 system with 4GB memory, Non-Paged Pool should remain over 285MB, Paged Pool over 330MB~360MB and System Free PTE should remain above 10,000. Additionally, Available Memory should not be lower than 50MB (these are approximate values and may be different for each system[1]). In a case in which recycling an application based on memory would cause these thresholds to be surpassed, Non-Overlapped Recycling Mode could help mitigate the situation, although it would impact the uptime of the application. In this type of recycling, the worker process is terminated first before spawning the new worker process.

    In general, if the application uses X MB of memory, and it’s configured to recycle when it reaches 50% over the normal consumption (1.5 * X MB), you will want to ensure that the system is able to support 2.5 * X MB during the recycle without suffering from system memory starvation. Consider also the need to determine what type of memory recycling option is needed. Applications that use large amounts of memory to store application data or allocate and de-allocate memory frequently might benefit from having Maximum Virtual Memory caps, whereas applications that have heavy memory requirements (e.g. a large application level cache), or suffer from memory leaks could benefit from Maximum Memory Used caps.

    The documentation suggests setting the Virtual Memory threshold as high as 70% of the system’s memory, and the Used Memory as high as 60% of the system’s memory. However, for recycling purposes, and considering that during the recycle two processes must run concurrently, these settings could prove to be a bit aggressive. As an estimate, for a 32 bit server with 4 GB of RAM, the Virtual Memory should be set to some value between 1.2 GB and 1.5 GB, whereas the Private bytes should be around 0.8 GB to 1 GB. These numbers assume that the application is the only one in the system. Of course, these numbers are quick rules of thumb and do not apply to every case. Different applications have very different memory usage patterns.

    For a more accurate estimation, you should monitor your application for a long enough period so as to capture the information about memory usage (private and virtual bytes) of the application during the most common scenarios and stress levels. For example, if your application is used consistently on an everyday basis, a couple of week’s worth of data should be enough. If your application has a monthly process and each week of the month has different usage (users enter information at the beginning of the month and heavy reporting activity occurs at the end of the month) then a longer period may be appropriate. In addition, it is important to remember that data may be skewed if we monitor the application during the Holidays (when traffic is only 10% of the expected), or during the release of that long-awaited product (when traffic is expected to go as high as 500% of the typical usage).

    Problems associated with recycling

    Although recycling is a mechanism that may enhance application stability and reliability, it comes with a price tag. Too little recycling can cause problems, but too much of a good thing is not good either.

    Non-overlapped recycles

    As discussed above, there are times when overlapped recycling may cause problems. Besides the memory conditions described above, if the application creates or instantiates objects for which only one instance can exist at a particular time for the whole system (like a named kernel object), or sets exclusive locks on files, then overlapped recycle may not be an option. It is very important to understand that non-overlapped recycles may cause users to see error messages during the recycling process. In situations like this, it is very important to limit recycles to a minimum.

    Moreover, if the time it takes for a process to terminate is too long, it may cause noticeably long outages of the application while it is recycling. In this case, it is very important to set an appropriate application pool shutdown timeout. This timeout defines the length of the period of time that IIS will wait for a worker process to terminate normally before it forces it to terminate. This is configured using the ShutdownTimeLimit metabase property (http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/1652e79e-21f9-4e89-bc4b-c13f894a0cfe.mspx?mfr=true).

    To configure an application pool to use non-overlapped recycles, set the application pool’s metabase property DisallowOverlappingRotation to true. For more information on this property, see the metabase property reference at http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/1652e79e-21f9-4e89-bc4b-c13f894a0cfe.mspx?mfr=true.

    Session state

    Stateful applications can be implemented using a variety of choices of were to store the session information. For ASP.Net, some of these are cookie-based, in-process, Session State Server, SQL Server. For classic ASP, this list is more limited. These implementation decisions have an impact on whether to use Web Gardens and how to implement Web Farms. Additionally, these may determine the effect that application pool recycling has on the application.

    In the case of in-memory sessions, the session information is kept in the worker process’ memory space. When a recycle is requested, a new worker process is started and session information on the recycled process is lost. This effectively affects any active session that exists in that application. This is yet another reason why the amount of recycles should be minimized.

    Application Startup

    Recycling an application is an expensive task for the server. Process creation and termination are required to start the new worker process and bring down the old one. In addition, any application level cache and other data in memory are lost and need to be reloaded. Depending on the application, these activities can significantly reduce the performance of the application, providing a degraded user experience.

    Incidentally, another post in this blog discusses another of the reasons application startup may be slow. In many cases this information will help speed up the startup of your application. If you haven’t done it yet, check it out at http://blogs.msdn.com/pfedev/archive/2008/11/26/best-practice-generatepublisherevidence-in-aspnet-config.aspx.

    Memory Usage checks

    When an application is configured to recycle based on memory, it is up to the worker process to perform the checks. These checks are done every minute. If the application has surpassed the memory caps at the time of the check, then a recycle is requested.

    Bearing this behavior in mind, consider a case in which an application behaves well most of the times, but under certain conditions, it behaves really badly. By misbehavior I mean very aggressive memory consumption during a very short period of time. In this case, the application can cause other problems by exhausting the system’s memory before the worker process checks for memory consumption and any recycle can kick in. This could lead to problems that are difficult to troubleshoot and limit the effectiveness of memory based recycling.

    Conclusion

    The following points summarize good practices when using memory based recycling:

    • Configure IIS so it logs all recycle events in the System Log.
    • Use memory recycling only if you understand the application’s memory usage patterns.
    • Use memory based recycling as a temporary workaround until permanent fixes can be found.
    • Remember that memory based recycling may not work as expected when the application consumes very large amounts of memory in a very short period of time (less than a minute).
    • Closely monitor recycles and ensure that they are not too many (reducing the performance and/or availability of the application), or too few (allowing memory consumption of the application to create too much memory pressure on the server).
    • Use overlapped recycling when possible.

    Thanks,

    Santiago

    Santiago Canepa is an Argentinean relocate working as Development PFE. In his beginnings, he made custom made software in Clipper. Worked for the Argentinean IRS (DGI) while working towards his BS in Information Systems Engineering. He worked as a consultant both in Argentina and the US on Microsoft technologies, and lead a development team before joining Microsoft. He has passion for languages (don’t engage in a discussion about English or Spanish syntax – you’ve been warned), loves a wide variety of music and enjoys spending time with his husband and daughter. Can usually be found online at ungodly hours in front of his computer(s).”


    [1] Vital Signs for PFE – Microsoft Services University

  • PFE - Developer Notes for the Field

    How to reset your Flip UltraHD Video camcorder

    I recently ran into a problem with my Flip UltraHD Video camcorder where it would not turn on.  Unlike other camcorders in the Flip family, there is no microscopic reset button anywhere on the device. After e-mailing support, I received...
  • PFE - Developer Notes for the Field

    AssemblyResolve Event and VJSharpCodeProvider

    • 9 Comments

    This one kept me up late the other night.  We had an issue where ASP.NET was recompiling a page and we were getting an exception:

    System.Exception:

    Error loading VJSharpCodeProvider, Version=2.0.0.0, Culture=neutral,PublicKeyToken=b03f5f7f11d50a3a

    This was interesting for two reasons:

    1. The application does not use J#.
    2. This is more subtle but nagged the back of my mind - Why is this thrown as a System.Exception and not a System.IO.FileNotFoundException or something more explicit since the framework typically throws specific exceptions.

    Some additional background - this problem was only occurring during acceptance testing and did not occur in dev.  They were able to load up the page and successfully run through the application.  Then after stressing the application they would eventually start getting these J# errors and other Dynamic compilation errors from ASP.NET.  This continued until they restarted IIS and cleared the Temporary ASP.NET folder.

    They did not have the J# Runtime installed on the system in dev or test and this was completely excepted because they were not using J#.  I got a look at the application and verified this.

    Side Note - I have been doing product support for all 6 years that I have been working at Microsoft.  For those of you out there doing similar stuff I am sure you know this and if not take heed - always verify things when working to resolve issues like this.  It is not that anyone out there is intentionally hiding things but people frequently are unaware and when you ask, "Is such and such the case?" they may very well answer incorrectly.  So verify and double check.  Part of having someone from support come in is to bring a fresh perspective that analyzes the problem from the outside and that can frequently be invaluable.

    So there is no J#.....Then why is this even trying to load?

    I am glad you would ask that because I have an answer.  Deep in the bowels of the ASP.NET compile engine and the Code DOM we enumerate all of the "Code Providers" and see if they are valid.  Part of this test is done by trying to load the assemblies.  So the default code providers are:

    • C#
    • VB.NET
    • JavaScript
    • J#
    • MC++

    You can actually see these defaults documented in the machine.config.comments file in your %windir%\Microsoft.net\framework\v2.0.50727\config folder:

    <compilers>
        <compiler language="c#;cs;csharp" extension=".cs" type="Microsoft.CSharp.CSharpCodeProvider, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" />
        <compiler language="vb;vbs;visualbasic;vbscript" extension=".vb" type="Microsoft.VisualBasic.VBCodeProvider, System, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089" />
        <compiler language="js;jscript;javascript" extension=".js" type="Microsoft.JScript.JScriptCodeProvider, Microsoft.JScript, Version=8.0.1100.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" />
        <compiler language="vj#;vjs;vjsharp" extension=".jsl;.java" type="Microsoft.VJSharp.VJSharpCodeProvider, VJSharpCodeProvider, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" />
        <compiler language="c++;mc;cpp" extension=".h" type="Microsoft.VisualC.CppCodeProvider, CppCodeProvider, Version=8.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a" />
    </compilers>

    Those are the 5 that will get searched by default and you can add additional code providers in your config file (you cannot remove from these 5 defaults ones as far as I can tell).

    All of this said it would seem that every application that does a dynamic compile must be looking for this J# provider.  So I went to do a quick test.  I created a simple ASPX page that did a dynamic compile.  Before I ran the page I when to the Fusion Log Viewer (http://msdn2.microsoft.com/en-us/library/e74a18c4(VS.80).aspx).  This tool will allow me to see what happens.  I clicked "Settings..." and checked "Log Binding Failures to Disk" and exited the application.  Then I did an IIS Reset to make sure that the setting was picked up and then I browsed to my page.  After the page came up I reopen the Fusion Log Viewer and sure enough there was a bind failure for the VJSharpCodeProvider.  If you do these same steps assuming you do not have the J# runtime installed you should see the same. 

    I am seeing the same failure to load the VJSharpCodeProvider but I did not any errors and the dynamic compilation did not fail.  This however is similar to what they see - It works at first and then later breaks...Reviewing briefly - We know that the VJSharpCodeProvider is not on their system but the failure to load this file is normal so why does it start becoming a problem?

    Well this is where their ResolveEventHandler (http://msdn2.microsoft.com/en-us/library/system.resolveeventhandler.aspx) comes in.  We found in one event log entry where they log the callstack that a ResolveEventHanlder was on top where they subscribed to the AssemblyResolve Event - http://msdn2.microsoft.com/en-us/library/system.appdomain.assemblyresolve.aspx.  From their we began looking into the code. 

    First off for some background - The AssemblyResolve Event is part of the Assembly loading that .NET does.  After it has searched for and failed to find an assembly it will raise this event.  The application can then explicitly load (typically using LoadFrom or something) the assembly and return a reference to the assembly.  This is essentially a way to redirect the application to an assembly.  Once we go the code we found that they filtered out a couple of specific exceptions and then on any that they did not have a location for they just threw a System.Exception.  Ahhhhh, that explains the nagging question about why the exception type was System.Exception and not something else.

    We took a look back at the AssemblyResolve Event and the docs have a glaring oversight - What do you do when you receive a request of an assembly that you do not know where it is?  The answer turns out to be - Return Null.  I verified this internally and some in the community has been nice enough to add this note in the "Community Content" section of the docs -http://msdn2.microsoft.com/en-us/library/system.appdomain.assemblyresolve.aspx

    In conclusion, when the application is first launched their code has not executed yet therefore the ResolveEventHandler has not been hooked up to the AssemblyResolve event.  However later in the run of the application when a recompile occurs and we reprobe for the VJSharpCodeProvider the exception is thrown and the ASP.NET Compile code is not expecting and exception do this leads to other errors.

    Once we fixed up the handler to return NULL the application worked.  It sailed through testing and I got to go to bed!!  I hope that this helps someone else out.

    Thanks,
    Zach

  • PFE - Developer Notes for the Field

    All the Ways to Capture a Dump...

    • 2 Comments

    Frequently when troubleshooting a problem we capture a memory dump.  Memory dumps are a great tool because they are a complete snapshot of what a process is doing at the time the dump is captured.  It can be a great forensic tool.  There are however an abundance of tools to capture memory dumps (*.dmp).  In talking with some new members on our team we were reviewing the dump capture techniques and the benefits/drawbacks.  If you have your own thoughts please feel free to comment.

    Memory dumps are captured under two categories of problems: The first category is hang dump captured when the application is non-responsive or when you just want a snapshot of the current state of the application. The second category is a crash dump captured when exceptions occur. The debugging purpose, the actions we would instruct the debugger to take would be different for both types of dump.

    For hang dumps, we would instruct debugger to attach to the process (if there is no debugger attached) and initiate the dumping immediately. For crash dumps, we would like to attach a debugger to a process first and then set up the instruction to monitor the type of exceptions we are interested in and generate a memory dump upon the exception is raised. We will discuss different ways to generate memory dumps.

    First let's take a look at some of more specific scenarios when capturing a dump file (these are grouped for the most part into crash or hang based on what they relate to.  There are definitely a lot more variants but these are some basic ones that we can use to compare the different debuggers:

    1. Crash
      1. Create a dump file when an application crashes (an exception has occurred that causes the process to terminate)
      2. Creating a dump file from an application that fails during startup
    2. Hang
      1. Create a dump file when an application hangs (stops responding but does not actually crash)
      2. Creating a dump file while an application is running normally 
    3. First Chance - Creating a dump file when an application encounters any exception during the run of the application
    4. Smaller Dump - Shrinking an existing dump file to reduce the amount of information in the dump file

    For each of these scenarios how do the debuggers measure up:

    Debsuggers/Scenarios Crash Hang First Chance Smaller Dump
    DebugDiag Yes Yes Yes No
    ADPlus Yes Yes Yes No
    CDB/Windbg Yes Yes Yes Yes
    Dr. Watson Yes Yes No No
    UserDump Yes Yes Yes No
    NTSD Yes Yes Yes Yes
    Visual Studio Yes Yes Yes No
    Vista Crash Dump Features Yes Yes No No

    Next let’s take a quick look at how you might use each one and then we will explore a bit more:

     

    Crash

    Hang

    ADPlus

    Adplus –crash –quiet -o c:\dumps –pn <processname.exe>

    adplus -hang -quiet -o c:\dumps -pn <processName.exe

    Dr. Watson

       

    CDB and WinDbg

    .dump /ma c:\w3wp.dmp

    .dump /ma c:\w3wp.dmp

    UserDump

    see Additional Details below

    Userdump PID

    DebugDiag

    Create a Crash Rule with Wizard

    Right click on the process in Processes and select Full Memory Dump.

    MiniDumpWriteDump

    see Additional Details below

    see Additional Details below

    Visual Studio

    Debug | Save Dump As...

    Debug | Save Dump As…

    There are many ways to capture a memory dump as you can see above. Depending on the nature of the problem, a hang or crash dump can be captured using different tools.

    For quick and easy hang dump in production environment that installing tools might be an issue, using NTSD debugger that is distributed with Windows XP/Server 2003 can be easy.

    However we frequently recommend to our customers to have the Debugging Tools for Windows and DebugDiag available for their servers.  ADPlus has a very low footprint and can be great to capture memory dumps with.  DebugDiag is super easy and great for setting up rules to generate dumps.  Unless there is a special case DebugDiag is the way to go.

    Additional Details

    Below are some details for most of the tools that we outlined above and when they might come in handy.

    External Tools

    DebugDiag.exe from Debug Diagnostic Tool

    Where do I get it?

    DebugDiag is another free downloadable tool from Microsoft. To download, search for “Debug Diagnostic Tool v1.1” from http://www.microsoft.com/downloads or specifically http://www.microsoft.com/downloads/details.aspx?FamilyID=28bd5941-c458-46f1-b24d-f60151d875a3&displaylang=en

    How do I use it?

    Hang - In “Processes” tab, select the process with right mouse, select “Create Full User Dump”. The memory dump will be located at under Logs\Misc folder under the installation folder of DebugDiag.

    Crash - Select Rules tab. Click “Add Rule” button. Select “Crash” as the rule type. Click “Next” button. Select target type, for instance, “A Specific Process”. Click “Next” buttons to finish. DebugDiag will monitor crashes and when a memory dump is captured, it will show the number in the “UserDump” column of the Rules panel.

    DebugDiag supports multiple crash rules to monitor multiple processes. It also supports many other configurations. For more information, please refer to the help file that is installed with DebugDiag.

    Also see the DebugDiag blog for additional information – http://blogs.msdn.com/debugdiag/

    Debugging Tools for Windows (WinDBG, CDB, ADPlus)

    Where do I get it?

    This set of tools is downloadable for free from http://www.microsoft.com/whdc/devtools/debugging/default.mspx

    How do I use it?

    Assuming it will be installed to C:\Debuggers folder, we will use that folder for examples.

    Catch a hang dump using ADPLUS.vbs

    Adplus.vbs is a vbsript that makes your life easy to use cdb debugger. To capture a hang dump to c:\dumps folder from a running instance of NOTEPAD.exe that has process ID 1806, type:

    C:\debuggers>adplus -hang -quiet -o c:\dumps -p 1806

    To capture hang dumps to c:\dumps for all instances of NOTEPAD.exe, type:

    C:\debuggers>adplus -hang -quiet -o c:\dumps -pn notepad.exe

    Catch a crash dump using ADPLUS.vbs

    To catch a crash dump from a running instance of NOTEPAD with process ID 1806, type:

    C:\debuggers>adplus -crash -quiet -o c:\dumps –p 1806

    To catch crash dumps on all instances of NOTEPAD.exe, type:

    C:\debuggers>adplus -crash -quiet -o c:\dumps -pn notepad.exe

    Note that after starting adplus, it will quickly exit since all it does is setting up CDB debugger for the work. After adplus command window exits, you will notice CDB command window come up(minimized by default). Just wait till the CDB finishes. The command windows will close.

    Catch a crash dump on any crashing process using CDB.exe

    Since adplus utility can only catch crash dumps for targeted process, it cannot be used system wide to capture crash dump for any crashing process. If we want to overcome the limitation of Dr. Watson, we can use CDB debugger for this purpose. To do this, type:

    C:\debuggers> cdb -iaec "-c \".dump /u /ma c:\dumps\av.dmp;q\""

    This will configure CDB debugger as the default handler for crash by AeDebug registry key. You can verify the setting by browsing to registry key:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug

    And see these two values:

    Value name: Auto Value data: 1

    Value name: Debugger Value data: “c:\debuggers\cdb.exe” -p %ld -e %ld -g -c “.dump /ma /u c:\av.dmp;q”

    Except for a different debugger, this configuration is identical to the configuration described above for using NTSD.exe. Since NTSD.exe does not have the feature to configure AeDebug registry key, we have to do it manually in the previous scenario.

    The advantage of using CDB as system wide crash handler is that it can capture crash dumps for any process on Windows XP/Server 2003/Vista.

    There are a lot more features to discuss about the Debugging Tools for Windows and there are a lot of resources out there.

    Before we leave this topic another cool feature of both CDB and WinDBG is that you can load a dump file into them for analysis and if you want to generate a smaller dump file to share the callstacks or something basic with another person via e-mail for instance.  You can generate a minidump using this command while debugging the dump:

    .dump /m c:\smallerDump.dmp

    This can be very helpful for providing a bit of information without having to transfer a huge DMP file.

    UserDump.exe

    Crash Dumps and hang dumps can be taken with the User Dump tool.  For all of the details and download information check out http://support.microsoft.com/kb/241215

    Dumps and Development Time

    Now when you are developing there are a couple of cases where you want to look at dump generation.

    Visual Studio

    Where do I get it?

    You purchase it.

    How do I use it?

    To generate dump files you need to have the C++ tools installed which will enable you to do Native Debugging.  Once you enable Native Debugging then whenever you have “broken” into the debugger you can go to to the Debug menu and select Save Dump As… and that will allow you to save a dump file.  This can be very helpful on a developers machine when they encounter a problem in an application they are debugging and want to save it for later analysis.  Also, I have used this to capture dumps of Visual Studio when it is going weird.  Just pop open a second instance of VS and attach to the first and take a memory dump.

    MiniDumpWriteDump

    Where do I get it?

    This is an API that you would have to build into your application to enable your application to generate a dump file.

    How do I use it?

    This is an API to generate dumps so it cannot be used directly but would have to be coded into a tool or application.  So customers will actually bake this into their Native Unhandled Exception filter.  MiniDumpWriteDump can have a dump taken using the following code:

    #include <dbghelp.h>
    #include <shellapi.h>
    #include <shlobj.h>
    
    int GenerateDump(EXCEPTION_POINTERS* pExceptionPointers)
    {
        BOOL bMiniDumpSuccessful;
        WCHAR szPath[MAX_PATH]; 
        WCHAR szFileName[MAX_PATH]; 
        WCHAR* szAppName = L"AppName";
        WCHAR* szVersion = L"v1.0";
        DWORD dwBufferSize = MAX_PATH;
        HANDLE hDumpFile;
        SYSTEMTIME stLocalTime;
        MINIDUMP_EXCEPTION_INFORMATION ExpParam;
    
        GetLocalTime( &stLocalTime );
        GetTempPath( dwBufferSize, szPath );
        StringCchPrintf( szFileName, MAX_PATH, L"%s%s", szPath, szAppName );
        CreateDirectory( szFileName, NULL );
    
        StringCchPrintf( szFileName, MAX_PATH, L"%s%s\\%s-%04d%02d%02d-%02d%02d%02d-%ld-%ld.dmp", 
                   szPath, szAppName, szVersion, 
                   stLocalTime.wYear, stLocalTime.wMonth, stLocalTime.wDay, 
                   stLocalTime.wHour, stLocalTime.wMinute, stLocalTime.wSecond, 
                   GetCurrentProcessId(), GetCurrentThreadId());
    
        hDumpFile = CreateFile(szFileName, GENERIC_READ|GENERIC_WRITE, 
                    FILE_SHARE_WRITE|FILE_SHARE_READ, 0, CREATE_ALWAYS, 0, 0);
    
        ExpParam.ThreadId = GetCurrentThreadId();
        ExpParam.ExceptionPointers = pExceptionPointers;
        ExpParam.ClientPointers = TRUE;
    
        bMiniDumpSuccessful = MiniDumpWriteDump(GetCurrentProcess(), GetCurrentProcessId(), 
                        hDumpFile, MiniDumpWithDataSegs, &ExpParam, NULL, NULL);
    
        return EXCEPTION_EXECUTE_HANDLER;
    }

    From http://msdn2.microsoft.com/en-us/library/bb204861.aspx

    Existing tools for Windows XP/Server 2003

    Dr. Watson

    Where do I get it?

    It is found at %SystemRoot%\system32\drwtsn32.exe and is distributed with Windows XP/Server 2003 and earlier versions.

    How do I use it?

    Configure Dr. Watson by typing:

    c:\windows\system32>drwtsn32

    You can configure the location of the memory dump (the default location is C:\Documents and Settings\All Users\Application Data\Microsoft\Dr Watson\user.dmp), the type of memory dump (for dump analysis purpose, select “Full” for Crash Dump Type. The default type is Mini). The problem with Dr. Watson is that it overwrites the user.dmp file. If you have more than one process crash or the same process crash due to different reasons, you have to make sure to move the user.dmp file before it is overwritten.

    NTSD.exe

    Where do I get it?

    From %SystemRoot%\System32 folder and is a debugger distributed with Windows XP/Server 2003 and earlier versions.

    How do I use it?

    It’s a command line debugger and can be used to capture crash dump as well as hang dumps.

    Crash Dump - To set up NTSD as system wide handler for crash, add two values under under AeDebug registry key:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\AeDebug

    Value name: Auto Value data: 1

    Value name: Debugger Value data: “ntsd.exe” -p %ld -e %ld -g -c “.dump /ma /u c:\av.dmp;q”

    Make sure to include the quotes in Value data.

    Upon process crash, OS will invoke NTSD.exe which will in turn attach to the process and capture a memory dump by executing the .dump debugger command.

    If you are curious what the command line means, the “-c” switch is for the debug command to be executed by NTSD debugger. The command “.dump” is for generating the memory dump. The switch “/ma” is for mini dump with all data streams. The “/u” switch is for the debugger to use a unique file name based on timestamp. The “;” delimits the first command. The second command “q” is for debugger to terminate the process and exit.

    Hang Dump - NTSD is a full featured debugger. It can be used to capture a hang dump given the process ID. For example, if NOTEPAD.exe is running with process ID 1806 (process ID can be obtained from task manager), you can type this command to capture a hang dump:

    C:\>ntsd -c “.dump /ma /u c:\hang.dmp;qd” -p 1806

    If you don’t like to type the whole command line again and again, you can create a batch file: HangDump.bat with this command:

    ntsd -c “.dump /ma /u c:\hang.dmp;qd” -p %1

    To use the batch file, type:

    C:\>HangDump 1806

    Generating dumps with Vista

    Vista does not ship with Dr. Watson or NTSD. Instead, Vista has “Problem Reports and Solutions” which is located under Control Panel->System and Maintenance.

    Capture a hang dump in Vista

    Inside Task Manager, select the process , right mouse click “Create Dump File”.

    clip_image002

    Capture a crash dump in Vista

    For process crash, Windows Error Reporting (WER) will try to analyze the crash first and then display this message box(where BADAPP is the crashing application):

    clip_image004

    When clicking “Close program”, the process will exit. By default no crash dump file will be kept on the local system. However, if you wish to retain it, you can modify this registry value ForceQueue to 1, which is located at:

    HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\Windows Error Reporting

    For dump files running in system context or elevated are located at:

    %ALLUSERSPROFILE%\Microsoft\Windows\WER\[ReportQueue | ReportArchive]

    Dump files from other processes are located at:

    %LOCALAPPDATA%\Microsoft\Windows\WER\[ReportQueue | ReportArchive]

    Starting with Vista SP1 and Windows Server 2008, more features for crash dumps will be supported. For more information, see Collecting User-Mode Dumps at http://msdn2.microsoft.com/en-us/library/bb787181(VS.85).aspx

    Credits

    A lot of people contributed thoughts comments and work to this so I just want to provide them with credit:

    Linkai Yu who did the bulk of the work. (I will introduce you to him in a subsequent post!)
    Michael Wiley, Pierre Donyegro, David Wu, Greg Varveris, Sean Thompson and Aaron Barth who contributed and helped review.

  • PFE - Developer Notes for the Field

    Troubleshooting Event ID 5010 — IIS Application Pool Availability

    • 2 Comments

    With this event you will get a message stating, “A process serving application pool '%1' failed to respond to a ping. The process id was '%2'.”  When an Event ID 5010  appears in your system log and WAS is the source, you can be assured of two things.  The first is that a worker process failed to respond to a ping from the Windows Process Activation Service (WAS).  The second is that WAS attempted to shut down the offending process.  Generally, this means that either the worker process hung or it did not have enough resources to respond to the ping request.  The resolution in the documentation for this Event is listed as:

    Diagnose an unresponsive worker process

    A worker process that fails to respond may exhibit one of the following symptoms:

    • Ping failure.
    • Communication failure with WAS.
    • Worker process startup time-out.
    • Worker process shutdown time-out.
    • Worker process orphan action failure.

     

    There are some links provided as well that provide guidance on troubleshooting memory leaks, long response times, and high CPU usage in the worker process.   In each of those 3 cases, there is a deterministic approach with which you can take a memory dump and investigate the issue.  However, in practice there are times when the worker process may simply hang without displaying any of those behaviors.  So the question becomes, how do we know when the worker process is hung?  The answer is, WAS knows, so we just have to ask WAS.

    I’m going to go through the steps of finding the breakpoints and getting the needed information because the methodology can be used elsewhere.  For example, though the steps at the bottom of the page work for IIS 7, 7.5, and 8, I haven’t tested them with IIS 6 or below.  If you’re currently experiencing this issue and just want to know how to get the memory dump of the Worker Process, feel free to skip down to the bottom.

    When working with an issue like this, our first step is always to clarify what we know and what we need to know.

    What we know:

    • WAS knows when a worker process hangs
    • WAS knows enough about the worker process to shut it down if it appears hung

     

    What we need to know:

    • How to tell when WAS thinks a worker process is hung
    • How to identify the hung process
    • How to create a dump of the worker process before it’s terminated by WAS

     

    To get to the answers we need, we’ll investigate the WAS process with WinDbg.  Our goal will be to find a method call that signifies a failed ping, set a breakpoint on it, then identify the associated worker process at the time of failure.  We’ll start by attaching the debugger to the WAS process and reloading symbols.  Symbols are important here because we’ll need them to investigate the method calls.  Once we have the debugger attached, we can take a look at the loaded modules in an effort to make some guesses at where we should look.  We’ll do this by running lm, for list modules.  We can append 1m and sm to only list the module names and sort them alphabetically.

    0:000> lm 1m sm
    ADVAPI32
    bcryptPrimitives
    clbcatq
    combase
    CRYPT32
    CRYPTBASE
    CRYPTSP
    DPAPI
    GDI32
    HTTPAPI
    IISRES
    iisutil
    iisw3adm
    KERNEL32
    KERNELBASE
    ktmw32
    logoncli
    mlang
    MSASN1
    msvcrt
    nativerd
    NSI
    ntdll
    ntmarta
    ole32
    OLEAUT32
    pcwum
    profapi
    RPCRT4
    rsaenh
    sechost
    Secur32
    SSPICLI
    svchost
    USER32
    USERENV
    w3ctrlps
    W3TP
    WS2_32
    XmlLite

    Now that we have the modules, we can make some guesses as to which ones might be related to the failed ping request processing.  It seems likely that the module responsible will either be one of the IIS processes or a process with w3 in the name.  Next in the debugger we can search for method calls in those modules that could likely be related.  We can use the x command to search for the methods using wild cards.

    0:000> x *iis*!*ping*
    000007fb`45efb2d4 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem (<no parameter info>)
    000007fb`45f2c330 iisw3adm!g_aIISWPSiteMappings = <no type information>
    000007fb`45f247dc iisw3adm!AddDomainMapping (<no parameter info>)
    000007fb`45f13fb0 iisw3adm!CONFIG_CS_PATH_MAPPER::SetConfigFileMappings (<no parameter info>)
    000007fb`45f2c4d0 iisw3adm!g_aIISWPGlobalMappings = <no type information>
    000007fb`45efdde4 iisw3adm!MESSAGING_HANDLER::HandlePingReply (<no parameter info>)
    000007fb`45efc59c iisw3adm!PingResponseTimerExpiredCallback (<no parameter info>)
    000007fb`45f24918 iisw3adm!EnsureLSAMapping (<no parameter info>)
    000007fb`45f2c490 iisw3adm!g_aIISULGlobalMappings = <no type information>
    000007fb`45f2c2d0 iisw3adm!aIISULSiteMappings = <no type information>
    000007fb`45f059f4 iisw3adm!PERF_MANAGER::ProcessPerfCounterPingFired (<no parameter info>)
    000007fb`45f2c168 iisw3adm!g_cIISWPSiteMappings = <no type information>
    000007fb`45efb64c iisw3adm!WORKER_PROCESS::CancelPingResponseTimer (<no parameter info>)
    000007fb`45f2d3cc iisw3adm!g_fWASStoppingInProgress = <no type information>
    000007fb`45f06be0 iisw3adm!PerfCountPing (<no parameter info>)
    000007fb`45f31dc0 iisw3adm!_imp_LsaManageSidNameMapping = <no type information>
    000007fb`45ed98a0 iisw3adm!CheckWASIsStopping (<no parameter info>)
    000007fb`45f31eb0 iisw3adm!_imp_CreateFileMappingW = <no type information>
    000007fb`45f31eb8 iisw3adm!_imp_OpenFileMappingW = <no type information>
    000007fb`45f2c164 iisw3adm!cIISULSiteMappings = <no type information>
    000007fb`45f2c484 iisw3adm!g_cIISULGlobalMappings = <no type information>
    000007fb`45efb164 iisw3adm!WORKER_PROCESS::SendPingWorkItem (<no parameter info>)
    000007fb`45efc51c iisw3adm!SendPingTimerCallback (<no parameter info>)
    000007fb`45f31df0 iisw3adm!_imp_CreateFileMappingA = <no type information>
    000007fb`45f31df8 iisw3adm!_imp_OpenFileMappingA = <no type information>
    000007fb`45f2c488 iisw3adm!g_cIISWPGlobalMappings = <no type information>

    The first method call returned looks very promising and is a good place to start.  We can set a breakpoint on that line item, tell the process to continue, then intentionally cause a failed ping request to see if the breakpoint trips.

     

    0:000> bp iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    0:000> g

    To cause a failed ping request, we can suspend a worker process with a tool such as Process Explorer.

    image

     

    After the time specified in the Application Pool’s configuration, the WAS process should identify the worker process as hung at attempt to shut it down.  If our breakpoint is correct, the debugger, still attached to the WAS process, should break in.  As it turns out, that is exactly what happens in this case.

    Breakpoint 0 hit
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705541a9cd000000

    The breakpoint has been hit, and if we check Process Explorer, we can see the process is still alive.  Further, we know because of the calling convention that we’re dealing with a WORKER_PROCESS object.  At this point, we could absolutely write a script to dump all of the worker processes on the machine and we would know we got a dump of the hung process.  If however, we’re working on a machine that has a lot of worker processes, it may be more beneficial to find the actual worker process being referred to and just dump that object.  The WORKER_PROCESS object will almost undoubtedly contain a PID for the worker process with which we could use with a script to dump just that worker process.  We can investigate the current thread and try to find the worker process object.  We can then look to see if there is an offset from the object that points to the PID.

    We’ll start by dumping the call stack.

    0:002> k
    Child-SP          RetAddr           Call Site
    000000cd`a86cfb98 000007fb`45ef7385 iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    000000cd`a86cfba0 000007fb`45ede8e3 iisw3adm!WORKER_PROCESS::ExecuteWorkItem+0x41
    000000cd`a86cfbf0 000007fb`45ede442 iisw3adm!WORK_ITEM::Execute+0x13
    000000cd`a86cfc30 000007fb`45edb129 iisw3adm!WORK_QUEUE::ProcessWorkItem+0xa2
    000000cd`a86cfc70 000007fb`45edafb7 iisw3adm!WEB_ADMIN_SERVICE::MainWorkerThread+0x35
    000000cd`a86cfcb0 000007fb`45ed9aec iisw3adm!WEB_ADMIN_SERVICE::ExecuteService+0xdf
    000000cd`a86cfcf0 000007fb`4e8f167e iisw3adm!WASServiceThreadProc+0x94
    000000cd`a86cfd60 000007fb`50533501 KERNEL32!BaseThreadInitThunk+0x1a
    000000cd`a86cfd90 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

     

    This gives us an idea of the calling order, but the public symbols we’re using won’t give us local variables and in this instance they won’t tell us if and what parameters are passed.  x86 Calling conventions dictate that in a 64 bit process, the first 4 parameters are stored in the CPU registers RCX, RDX, R8, then R9 in that order.  Any parameters in addition to that will be stored on the stack.  If this were a 32 bit process, the parameters would most likely all be on the call stack.  Knowing this, we can start by searching the call stack and the 4 registers mentioned for the PID we identified.  In fact, any of the non null registers displayed with the r command are probably a good place to look, excluding RIP which will point to code.

    0:002> r
    rax=0000000000000000 rbx=0000000000000000 rcx=000000cda910e520
    rdx=000000cda94159d0 rsi=0000000000000001 rdi=000000cda910e520
    rip=000007fb45efb2d4 rsp=000000cda86cfb98 rbp=000000cda8709ec0
      r8=000000cda86cfb28  r9=000000cda86cfc10 r10=0000000000000000
    r11=0000000000000246 r12=0000000000000000 r13=0000000000000000
    r14=0000000000000000 r15=0000000000000000
    iopl=0         nv up ei pl zr na po nc
    cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000


    We can see that RAX and RBX are null, but RCX and RDX are not.  We’ll start with the call stack then move on to the other registers. For the call stack, we’ll search between 100 bytes under the top of the stack pointer (RSP) and up to the stack base (RBP).  Remember, stacks call stacks start at a higher address and grow down as they move up the stack.  We’re searching for a word value because that’s plenty of space for the PID to be stored in and searching for our known PID, 8012.

    0:002> s -[1]w @rsp-100 @rbp 0n8012

    This returned no results.  So, we don’t have the PID on the call stack.  Next we’ll search from RCX to 300 Bytes greater than RCX.

    0:002> s -[1]w @rcx L300 0n8012
    0x000000cd`a910e568
    0x000000cd`a910e56c

    In this case we have two results.  We can see how far from RCX the first result is.

    0:002> ?0x000000cd`a910e568-@rcx
    Evaluate expression: 72 = 00000000`00000048

    We get an offset of 72 bytes, which is very close.  We can do a couple things to verify that we can use this offset in the future.  First, we can check to see if we have any symbol information near RCX.

    0:002> dps @rcx
    000000cd`a910e520  000007fb`45ecc2b8 iisw3adm!WORKER_PROCESS::`vftable'
    000000cd`a910e528  000000cd`43525057
    000000cd`a910e530  000000cd`a90f4998
    000000cd`a910e538  000000cd`a90f4998
    000000cd`a910e540  00000000`00000005
    000000cd`a910e548  000000cd`a9390db0
    000000cd`a910e550  00000000`00000006
    000000cd`a910e558  000000cd`a90f4930
    000000cd`a910e560  000000cd`a90f4b40
    000000cd`a910e568  00001f4c`00001f4c
    000000cd`a910e570  00000000`0000031c
    000000cd`a910e578  00000000`00000000
    000000cd`a910e580  000000cd`a7f7ed20
    000000cd`a910e588  00000000`00000001
    000000cd`a910e590  000000cd`00000000
    000000cd`a910e598  00000000`00000000

     

    It seems RCX is a pointer to a WORKER_PROCESS virtual function table.  This means it is essentially the address for the WORKER_PROCESS object.  It also means our offset of RCX+72 is probably correct and could be used consistently.  We can be sure with one last check.  We need to let the debugger continue, create and pause another worker process, let the debugger break in at our earlier breakpoint, then validate that RCX+72 points to the new PID.

    With the worker process paused, we note our new PID is 4356.  In Hex that comes to 1104, so that’s what we’re looking for.

    0:002> g
    Breakpoint 0 hit
    iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem:
    000007fb`45efb2d4 48895c2410      mov     qword ptr [rsp+10h],rbx ss:000000cd`a86cfba8=705d41a9cd000000
    0:002> dw @rcx+0n72 l1
    000000cd`a910e568  1104

     

    Clearly, our offset works as it points to the correct new PID.  This means we can use our breakpoint and this offset to create a script to dump the hung worker process.  Below, I’ve listed the steps to script this with DebugDiag, but it could easily be adapted to work with ADPlus or even CDB and WinDbg directly.

    (On a side note, review of the disassembly for the PingResponseTimerExpiredWorkItem showed that it did not have any parameters.  RCX was clearly not a parameter, but it’s a common enough register that it is still good to check.  It definitely paid off in this instance.)

     

    Steps to get a memory dump using DebugDiag from the faulting worker process when a 5010 is raised.

    The assumption is made that the production machine with the issue does not have internet access as this is a common practice.  If it’s not the case and internet access is available, you can complete Step 1 on the production machine.

     

    Step 1: Get Symbols for WAS Service

    1. Open DebugDiag on a production server
    2. Navigate to the Processes tab
    3. Sort by the Service Column Descending and identify the line containing the process for the WAS service
    4. Right click the WAS service and choose create full memory dump
    5. Zip the dump and copy it to a machine with external internet access
    6. On the new machine use the LoadSymbols script found here: http://blogs.msdn.com/b/puneetgupta/archive/2010/05/14/debugdiag-script-to-load-all-symbols-in-a-dump-file.aspx to collect symbols for the WAS service
    7. On the analysis tab of DebugDiag, highlight the line for LoadSymbols
    8. Click Add Data File
    9. Navigate to the dump from the prod server and click ok
    10. Now click Start Analysis
    11. By default, DebugDiag will save the symbol files to c:\symcache.  Copy the contents of that folder to the same path on the production machine.  (If you’ve changed the symbol settings in DebugDiag previously, you must make sure you’re pointing to the Microsoft symbol server and make note of your local path)

     

    Step 2: Set up DebugDiag on Production machine

    1. On the Rules tab, click Add Rule and select a Crash Rule
    2. Select “A specific NT Service” and click Next
    3. Highlight the WAS service and click Next
    4. Click Breakpoints then Add Breakpoint
    5. In Breakpoint Expression enter: iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    6. Under Action Type select Custom
    7. Add the following script in the window that pops up:
    Dim w3wpProcessId
    rcx = CausingThread.Register("rcx")
    w3wpProcessId = Debugger.ReadDWORD(rcx + 72)
    WriteToLog "w3wp process id = " & w3wpProcessId
    Debugger.CreateDumpForProcessID w3wpProcessID,"Failed Ping",False

     

    1. Set the Action Limit to 5 then click ok
    2. Click Save & Close
    3. Click Next all the way through.  You can specify the location of the dump file on one of the screens before you click finish.
    4. Check the log for svchost under c:\program files\DebugDiag and make sure it shows that the breakpoint was set.  If it was not set, try restarting the Debug Diagnostic Service.  The log should have something like the following:

     

    [9/30/2012 2:46:53 PM] Initializing control script
    [9/30/2012 2:46:53 PM] Clearing any existing breakpoints
    [9/30/2012 2:46:53 PM]
    [9/30/2012 2:46:53 PM] Attempting to set unmanaged breakpoint at iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem
    [9/30/2012 2:46:53 PM] bpID = 0
    [9/30/2012 2:46:53 AM] Current Breakpoint List(BL)
      0 e 000007fb`45efb2d4     0001 (0001)  0:**** iisw3adm!WORKER_PROCESS::PingResponseTimerExpiredWorkItem

     

    You can test the above set up by suspending a worker process with Process Explorer and making sure a memory dump is created.  That’s all there is to it.  Of course, when you have the memory dump, that’s just the beginning as you will have to debug that dump to identify the cause of the hang.

     

     

    -Jarrod

  • PFE - Developer Notes for the Field

    $(SolutionDir) and MSBUILD

    • 2 Comments

    A while back I was working with a customer that was moving from doing all of their builds with the devenv command line to using MSBUILD.  However we ran into a hiccup that did not make sense at first.  Basically when they tried to build their solution using MSBUILD some of the custom build steps where failing because $(SolutionDir) was not defined. 

    After thinking about it a bit is became clear that MSBUILD did not define the SolutionBuild property that Visual Studio does.  So where did that leave us?

    There are a couple of basic approaches:

    1. Create a Custom Action (this would involve Custom Code) that walked up the directory structure and found the SLN file and then used that directory to create the SolutionDir.
    2. Use the command line option /Property:SolutionDir=<sln Dir> to set the value.  This way it is controlled in one place.
    3. Change all of the pathing to be relative.  In most groups it is relatively infrequent to be moving projects up and down the directory hierarchy.

    Just thought I would share incase this comes up for someone else.

    Zach

  • PFE - Developer Notes for the Field

    Whodunit: Who threw the message box, and why?

    • 1 Comments

    My name is Brad and I’ve been on the PFE team here at Microsoft for many years. Suffice to say, I’m overdue for contributing to the team blog. I’ve seen lots of interesting (and not-so-interesting) issues with customers all over the world in my time at PFE. What follows is an issue I worked earlier this year. For me, the most interesting part of this issue was not so much in finding root cause as it was the process of discovering who was behind root cause.

     

    The Problem

    It all started with a customer who reported their ASP.NET application had an OutOfMemory issue. These kinds of issues are not at all uncommon in the .NET world, and the trick usually comes down to finding what object(s) are rooted so that the .NET Garbage Collector can’t reclaim the memory associated with said object(s).

     

     Getting data from the problem

    They sent me a dump of the problematic application pool, mentioning that they dumped it two hours after they received notice about the OutOfMemoryException (OOM). My initial thought on this was that they had recycled the process and then obtained a dump of the fresh new process. This would obviously be no good since the w3wp instance exhibiting the OOM was gone, and the dump instead represented a new process instance with no memory pressure. Unfortunately, this wouldn’t be the first time I’d had this problem.

    However, when I received the dump, I was pleasantly surprised to see that they had dumped the correct process instance. It was over 1GB in size, representing a 32-bit process with significant memory pressure. And when I looked at the length of time the process had been alive, that verified this was the same instance that threw the OOM.


    Windows Server 2003 Version 3790 (Service Pack 2) MP (4 procs) Free x86 compatible
    Product: Server, suite: TerminalServer SingleUserTS
    Machine Name:
    Debug session time: Thu Feb 25 15:07:20.000 2010 (UTC - 5:00)
    System Uptime: 194 days 18:11:06.875 Process Uptime: 0 days 15:03:28.000

     

    Analyzing the data, Part 1

    After searching through the dump I found the problematic object that led to the OOM. And after talking with the right folks at the customer, we pieced back together how the problem arose in addition to a resolution to this problem.

    So one question was answered, but another one remained: How did this process manage to stay alive for a full two hours after getting the OOM? I’ve been debugging issues like this since .NET 1.0 was released, and I can tell you that this wasn’t a normal set of circumstances.

    The thread that threw the OOM turned out to be the same thread that answered the question.

    # ChildEBP RetAddr
    00 0c3aedb8 7c82775b ntdll!KiFastSystemCallRet
    01 0c3aedbc 773d7a4b ntdll!NtRaiseHardError+0xc
    02 0c3aee18 773b8377 user32!ServiceMessageBox+0x145
    03 0c3aef74 7739eec9 user32!MessageBoxWorker+0x13e
    04 0c3aefcc 773d7d0d user32!MessageBoxTimeoutW+0x7a
    05 0c3af000 773c42c8 user32!MessageBoxTimeoutA+0x9c
    06 0c3af020 773c42a4 user32!MessageBoxExA+0x1b
    07 0c3af03c 7c34c224 user32!MessageBoxA+0x45
    08 0c3af070 7c348e6c msvcr71!__crtMessageBoxA+0xf4
    09 0c3af294 7c34cf83 msvcr71!_NMSG_WRITE+0x12e
    0a 0c3af2cc 7a09bea7 msvcr71!abort+0x7
    0b 0c3af2d8 77e761b7 mscorwks!InternalUnhandledExceptionFilter+0x16
    0c 0c3af530 77e792a3 kernel32!UnhandledExceptionFilter+0x12a
    0d 0c3af538 77e61ac1 kernel32!BaseThreadStart+0x4a
    0e 0c3af560 7c828752 kernel32!_except_handler3+0x61
    0f 0c3af584 7c828723 ntdll!ExecuteHandler2+0x26
    10 0c3af62c 7c82863c ntdll!ExecuteHandler+0x24
    11 0c3af90c 77e4bee7 ntdll!RtlRaiseException+0x3d
    12 0c3af96c 78158e89 kernel32!RaiseException+0x53
    13 0c3af9a4 7a14fa70 msvcr80!_CxxThrowException+0x46
    14 0c3af9b8 7a108013 mscorwks!ThrowOutOfMemory+0x24
    15 0c3afae4 7a109f7d mscorwks!Thread::RaiseCrossContextException+0x408
    16 0c3afb98 79fd878b mscorwks!Thread::DoADCallBack+0x2a2
    17 0c3afbb4 79e983fb mscorwks!ManagedThreadBase_DispatchInner+0x35
    18 0c3afc48 79e98321 mscorwks!ManagedThreadBase_DispatchMiddle+0xb1
    19 0c3afc84 79e984ad mscorwks!ManagedThreadBase_DispatchOuter+0x6d

    <clipped for brevity>

    As seen in the stack above, after the OOM was thrown in frame 0x14, a message box was thrown in frame 0x8. A message box will keep the process alive until someone clicks OK, Cancel, etc. and the message box goes away. In short, message boxes in server-side processes are never a good thing since they will hang your application!

    Once again, we have answered one question, but another question remains: the process stayed alive for two hours after it threw the OOM because a message box was thrown, but who threw the message box, and why?

     

    Analyzing the data, Part 2

    As is common when finding a message box that’s popped up in a process, I wanted to see what the message box said. From MSDN, we learn that the second parameter to user32!MessageBoxExA is the text of the message, and the third parameter is the text of the caption in the message box. Using the kb command, we can retrieve these parameters, and then dump them to find the values:

    0:023> kbn7
    # ChildEBP RetAddr Args to Child
    00 0c3aedb8 7c82775b 773d7a4b 50000018 00000004 ntdll!KiFastSystemCallRet
    01 0c3aedbc 773d7a4b 50000018 00000004 00000003 ntdll!NtRaiseHardError+0xc
    02 0c3aee18 773b8377 0dbec7a8 0bcda228 00012010 user32!ServiceMessageBox+0x145
    03 0c3aef74 7739eec9 0c3aef80 00000028 00000000 user32!MessageBoxWorker+0x13e
    04 0c3aefcc 773d7d0d 00000000 0dbec7a8 0bcda228 user32!MessageBoxTimeoutW+0x7a
    05 0c3af000 773c42c8 00000000 0c3af0a4 7c37f480 user32!MessageBoxTimeoutA+0x9c
    06 0c3af020 773c42a4 00000000 0c3af0a4 7c37f480 user32!MessageBoxExA+0x1b

    0:023> da 7c37f480
    7c37f480 "Microsoft Visual C++ Runtime Lib"
    7c37f4a0 "rary"

    0:023> da 0c3af0a4
    0c3af0a4 "Runtime Error!..Program: c:\wind"
    0c3af0c4 "ows\system32\inetsrv\w3wp.exe..."
    0c3af0e4 ".This application has requested "
    0c3af104 "the Runtime to terminate it in a"
    0c3af124 "n unusual way..Please contact th"
    0c3af144 "e application's support team for"
    0c3af164 " more information..."

    I’ve seen this message before, and it can have different underlying causes. The bottom line – at this point in the troubleshooting phase – is that it isn’t a custom message box thrown carelessly by application code. The trick now is to determine how the message box was thrown.

    I made a few unusual observations about the call stack in question.

    - First, there are two versions of the C++ runtime on the stack (7.1 and 8.0). This isn’t common.

    - Secondly, the sequence of events as told by the stack seems very unusual. It appears the .NET Framework throws an OOM (frame 0x14). Then eventually when the underlying OS handles the exception in frame 0xe, it somehow goes back to BaseThreadStart in frame 0xd where the default unhandled exception filter (UEF) is called in frame 0xc. From there we wind up back in the .NET Framework’s UEF in frame 0xb that appears to call the 7.1 CRunTime’s abort() function in frame 0xa. Finally, a message box results.
    What a wild ride!!

    - Finally… wait, the .NET Framework appears to throw a message box (if you observe frames 0xb through 0x8 of the stack)??? This is something I never would have predicted!

    Fortunately, I’d been fooled by such an appearance before, so I knew to take a look at the raw stack first, before I went reading through the .NET source to see if mscorwks!InternalUnhandledExceptionFilter really does call for a message box to be thrown. When I looked at the raw stack around the frames 0xb through 0x8, this is what I found in between all the frames displayed from a kb command:

    0:023> dds 0c3af070 L90
    0c3af070 0c3af208
    0c3af074 7c348e6c msvcr71!_NMSG_WRITE+0x12e
    0c3af078 0c3af0a4
    0c3af07c 7c37f480 msvcr71!`string'
    <clip>
    0c3af26c 77e66ebd kernel32!VirtualQueryEx+0x1d
    0c3af270 000d0f48
    0c3af274 0000003a
    0c3af278 0008b788
    0c3af27c 02403a0a
    0c3af280 0000003a
    0c3af284 013af29c*** WARNING: Unable to verify checksum for ThirdParty.dll
    *** ERROR: Symbol file could not be found. Defaulted to export symbols for ThirdParty.dll -
    ThirdParty!FooClass::m_bVar+0x9cc8
    0c3af288 0c3af2c4
    0c3af28c 7c349600 msvcr71!_beginthreadex+0x7
    0c3af290 90a99269
    0c3af294 0c3af2c4
    0c3af298 7c34cf83 msvcr71!abort+0x7 <—frame 0xa of the kn callstack above
    0c3af29c 0000000a
    0c3af2a0 00000000
    0c3af2a4 0c3af558
    0c3af2a8 7a3b2000 mscorwks!COMUnhandledExceptionFilter
    0c3af2ac 0c3af2a0
    0c3af2b0 00000103
    0c3af2b4 0c3af520
    0c3af2b8 7c34240d msvcr71!_except_handler3
    0c3af2bc 7c380ea0 msvcr71!type_info::`vftable'+0x94
    0c3af2c0 ffffffff
    0c3af2c4 0c3af2d8
    0c3af2c8 7c35f0a8 msvcr71!__CxxUnhandledExceptionFilter+0x2b
    0c3af2cc 77ecb7c0 kernel32!gcsUEFInfo
    0c3af2d0 7a09bea7 mscorwks!InternalUnhandledExceptionFilter+0x16
    <—frame 0xb of the kn callstack above

    Aha! So there is, in fact, someone else in between the .NET Framework and the call to throw the message box.

    Now, let’s say you still want some kind of proof that the .NET Framework doesn’t make the call to msvcr71!abort(), which results in a call to show the message box. I had my doubts that mscorwks.dll had the 7.1 CRT as a dependency. On a tip from fellow PFE Zach Kramer, you can prove this by running dumpbin /imports on the binary (Dumpbin is an old – but very handy - utility that still ships with Visual Studio). I could always obtain the binary by asking my customer for their mscorwks.dll, but it’s much easier to just use psscor2!savemodule.

    0:023> lmvmmscorwks
    start end module name
    79e70000 7a400000 mscorwks
    Loaded symbol image file: mscorwks.dll

    0:023> !savemodule 79e70000 F:\modules\mscorwks.dll
    5 sections in file at 79e70000
    Successfully saved file: F:\Modules\mscorwks.dll

    Then from a command window:

    f:\Modules>dumpbin /imports mscorwks.dll >Imports.txt

     

     Searching Imports.txt for the string ‘msvcr7’ comes up empty. In fact, the build of mscorwks used by my customer’s application was msvcr80.dll. This makes sense when you look at frames 0x14-0x13 in our call stack.

    What if we employ the same technique on this ThirdParty.dll – will it show the assembly depends on the 7.1 CRT? Scanning the output for dumpbin /imports on ThirdParty dll shows the following:

    Microsoft (R) COFF/PE Dumper Version 9.00.30729.01
    Copyright (C) Microsoft Corporation. All rights reserved.

    Dump of file ThirdParty.dll
    File Type: DLL

      Section contains the following imports:
       MSVCR71.dll
            1000E154 Import Address Table
            10010554 Import Name Table 
               0 time date stamp
               0 Index of first forwarder reference

    <clip>

    So we know this ThirdParty.dll has the 7.1 CRT as a dependency, but that alone doesn’t prove this component throws the message box. And according to my customer, the vendor of ThirdParty.dll had been approached many times in the past regarding these “phantom message boxes” hanging production applications. But the vendor had denied any involvement without explicit proof. So the presence of their dll on the stack next to a call to msvcr71!abort() might not suffice when I confronted them with this issue. I felt I still had some work ahead of me.

     

    Deeper dive

    Before continuing, let’s do a quick review of what we know and don’t know:

    1. The process hung because a message box was thrown. Ironically, this kept the process alive so that a dump could be taken, and from this dump we learned root cause of the OOM.

    2. The message box text and caption indicate it was thrown as a result of some systematic process – not due to some “rogue code” in the customer’s application.

    3. Contrary to the appearance of the kb command, the message box was not thrown by the .NET Framework.

    4. Based on the placement of ThirdParty.dll in the raw stack, the fact that ThirdParty.dll has the 7.1 CRT as a dependency, and the fact that the stack shows the message box was thrown directly by the CRT, and we’re likely to make the most progress by trying to rule out ThirdParty.dll as the culprit (or alternatively, prove that it is the culprit).

    5. From the call stack, it appears the .NET Framework had already instructed the OS to handle the exception. Why does it appear that the .NET Framework got a second go-around at handling the exception, and how do we connect the dots from the .NET Framework, to this ThirdParty.dll, to the 7.1 CRT’s call to abort()?

    6. Why does msvcr71!abort() throw a message box – was it explicitly instructed to do this by someone?

    First, let’s tackle the question about connecting the dots seen in the call stack frames. There’s a pretty clear and concise explanation for this in an MSDN Magazine article from 2008.

    Unhandled Exception Processing In The CLR

    When an exception goes unhandled and the OS invokes the topmost [Unhandled Exception Filter], it will end up invoking the CLR's UEF callback. When this happens, the CLR will behave like a good citizen and will first chain back to the UEF callback that was registered prior to it. Again, if the original UEF callback returns indicating that it has handled the exception, then the CLR won't trigger its unhandled exception processing

    In other words, ThirdParty.dll had registered its UEF before the .NET Framework. And its UEF took the default road of calling abort()  and throwing a message box. After ThirdParty.dll registered its UEF, the .NET Framework then registered its UEF callback. But it didn’t want to be rude and step over the UEF that ThirdParty.dll had registered first, so it chains back to it. Therefore, the result of msvcr71!abort() being called is due to the UEF registered by ThirdParty.dll.

    Next, let’s tackle the question about msvcr71!abort() throwing a message box. Since this ThirdParty.dll was using the 7.1 build of the CRT, let’s look in MSDN for the information on msvcr71!abort():

    abort determines the destination of the message based on the type of application that called the routine. Console applications always receive the message through stderr. In a single or multithreaded Windows application, abort calls the Windows MessageBox API to create a message box to display the message with an OK button. When the user clicks OK, the program aborts immediately.

    To influence the behavior of abort(), simply call _set_error_mode in the DllMain so that it doesn’t exercise the default behavior of throwing a message box. MSDN’s documentation on _set_error_mode states you can use _OUT_TO_STDERR for the lone parameter, and this will avoid the message box when abort() is called.

    A colleague of mine, Senior Escalation Engineer Bret Bentzinger, offered to write some sample code that would load ThirdParty.dll and test this proposed resolution of passing _OUT_TO_STDERR to _set_error_mode. Doing this confirmed that no message box was thrown, and the thread exits without hanging the process.

     

    In the end, this problem of throwing a message box from the CRT’s call to abort() isn’t a new one. This issue has been around for ages. But it was my first opportunity to drive such an issue and see it through to a resolution. My customer got a fix from the vendor, and we were able to help the vendor write better, more stable code. It was a win-win situation!

  • PFE - Developer Notes for the Field

    Pivoting ASP.NET event log error messages

    Unless you’ve been hiding under the proverbial rock, you’ve probably seen the recent Pivot hoopla .  If you’re not familiar with it, it’s a way to visualize a large amount of data in a nice filterable format.  The nice thing about it is that...
  • PFE - Developer Notes for the Field

    Using DebugDiag's LeakTrack with ProcDump's Reflected Process Dumps

    • 0 Comments

    DebugDiag and Procdump are two of my favorite tools.  They're both incredibly useful and I rely on them heavily to gather MemoryDumps when debugging production issues.  They have a myriad of overlapping features, but they each have (at least) one feature that makes them stand apart from each other:

    • DebugDiag has the LeakTrack feature.  This lets users inject a module that tracks callstacks for allocations and stores them to heuristically identify the cause of native memory leaks in an application.  I use the feature whenever I have a native memory leak and find it indispensable.
    • ProcDump has the ability to reflect a process before creating a memory dump.  This substantially reduces the amount of time the process is interrupted when a memory dump is taken.  When dealing with production issues, this is an extremely efficacious feature.

    The catch is, DebugDiag currently can't create reflected process dumps, and ProcDump doesn't have a way to inject the LeakTrack dll to track allocations.  We can get around this by working with both tools.  Assuming both tools are installed on the server with a leaky application, the process is as follows:

    1. Identify the application that has a native memory leak: http://blogs.microsoft.co.il/blogs/sasha/archive/2008/07/13/is-it-a-managed-or-a-native-memory-leak.aspx
    2. Use DebugDiag to inject the LeakTrack dll and begin tracking allocations
    3. Set up ProcDump to create a dump of the reflected process when a memory threshold is breached

    Step 1 is easy and documented elsewhere.  Step 2 is easy if we have direct GUI access to the machine.  We can simply go to the Processes tab in DebugDiag, right click the process, and chose "Start Monitory for Leaks."  For the sake of argument, let's say we want to script the whole process.  We can do that by scripting DebugDiag and ProcDump to do the individual tasks we've set out for them.  Once we have the PID of the troubled process, we can use a script to inject the LeakTrack dll into the process.  The following vbs script will, when used by DebugDiag, start the native leak tracking process:

    Sub Debugger_OnInitialBreakpoint()
       Debugger.Write "Injecting LeakTrack" & vbLf
       Debugger.InjectLeakTrack
       Debugger.Write "Detaching2" & vbLf
       Debugger.Detach
    End Sub

    The above code can be saved into a vbs file.  For this post we'll assume it was saved to c:\scripts\InjectLeakTrack.vbs.  We'll also assume our PID was 1234

    With the PID known and the script created, we can launch DebugDiag from a command line as such (if you installed DebugDiag elsewhere, you'll have to adjust the install location): C:\PROGRA~1\DEBUGD~1\DbgHost.exe -script "c:\scripts\InjectLeakTrack.vbs" -attach 1234

    This will cause DbgHost, the debugger process DebugDiag uses, to attach to the process with PID 1234 and run the script we provided.  The script simply injects the LeakTrack dll and detaches the debugger.  Now that LeakTrack is injected, we can launch ProcDump and tell it to reflect the process and create a memory dump when we've crossed a memory threshold.  Let's say we know our application has leaked if it ever crosses 600MB.  We can kick off ProcDump as follows: procdump 1234 C:\Dumps\Over600.dmp -accepteula -r -m 600 -ma

    ProcDump will then monitor the process and create a reflected process dump when the committed bytes of the process breach 600 MB.  This memory dump can be analyzed by DebugDiag on the Advanced Analysis tab, and the offending callstacks can be identified.

     

    -Jarrod

     

     

  • PFE - Developer Notes for the Field

    Debugging Large ViewState

    • 1 Comments

    This week I have been working with a customer that had pretty large ViewStates that were getting pushed up and down between the client and the server.  The application was moving about 200+ KB of ViewState between the client and server.  This is something we picked out quickly as a scale issue.

    Now Tess has a great blog on why this is a problem from a memory perspective and provides some background on the issue:

    ASP.NET Memory – Identifying pages with high ViewState
    http://blogs.msdn.com/tess/archive/2008/09/09/asp-net-memory-identifying-pages-with-high-viewstate.aspx

    In this customer’s case we were seeing multisecond times for downloading a page from the server and sending the response back to the server so this was also having an impact on performance even before it because a memory problem.  They would run into the memory issues also once load is ramped up.

    The customer posed the question - “Well how do I know what is in my ViewState and what is causing the bloat?”

    This is a fair question and we referred to Tess’s blog but hit some snags that I wanted to comment on incase other people hit these. (read her blog first for background if you have questions)

    Where is the ViewState? 

    To get started we downloaded ViewState Decoder (2.2) but we could not point it directly at the page due to permissions issues. Therefore we went to pull the ViewState from the “View Source” of the page when viewing in IE.  A problem arose because as we navigated the page we would see the page sizes increase in the IIS logs (SC-BYTES value) but the page remained the same size when we did “View Source”.  Basically if we saved the “View Source” that page was identical even though a number of postbacks had occurred.  When I opened the IIS Log I saw each request and the log reported that SC-BYTES (Bytes Sent) was increasing (almost double the initial page request) so something was not adding up.

    In this application it has one page with multiple tabs and as we move through the tabs the page grows but as we said that was not reflected in the “View Source.”  It turns out after talking with the customer that they are using the Updater Panel from the AJAX toolkit and each of the postbacks is an update to that panel.  These updates are not reflected in the “View Source.” To capture this easily we returned to an old favorite – Fiddler from http://www.fiddlertool.com/fiddler/

    From there we were able to get the ViewState out.  Because this was being sent back to AJAX the ViewState showed up like this:

    |hiddenField|__VIEWSTATE|/wE
    <CLIP>
    EA==|9168|hiddenField|__VIEWSTATEENCRYPTED  

    You will notice that these are not normal tags.  Basically the ViewState continued between the “|”.  We grabbed this but and dropped it into the ViewState Decoder but got an error that the ViewState was invalid:

    There was an error decoding the ViewState string: The serialized data is invalid

    Why is the ViewState string invalid if it worked?

    We puzzled over this for a bit and then I noticed what you may already have noticed in the list of hidden fields:

    __VIEWSTATEENCRYPTED

    It turns out that this indicates that the ViewState is encrypted.  So it was understandable that the tool could not easily decode this via my cut and paste of the ViewState.  We decided to turn off ViewStateEncryption for this troubleshooting since the application was just in test.  Basically we add ViewStateEncryptionMode=”none” for this test to Page Directive on the page.  Then BINGO!  We were able to decode the ViewState and get some rich information.

    Here are some links on encryption:

    @ Page Directive
    http://msdn.microsoft.com/en-us/library/ydy4x04a.aspx

    ViewStateEncryptionMode
    http://msdn.microsoft.com/en-us/library/system.web.ui.viewstateencryptionmode.aspx

    For ViewStateEncryptionMode the default is auto which means if any of your control request encryption using the following API ViewState will get encrypted:

    RegisterViewStateEncryption 
    http://msdn.microsoft.com/en-us/library/system.web.ui.page.registerrequiresviewstateencryption.aspx

    All That Data, There Must be an Easier Way!

    Now as we scrolled through all of the data in the output we could see what is stored in the ViewState but tying back to a particular control is not simple.  Also, since there was so much stored in this ViewState it was impossible to easily know what controls were the worst offenders.  So we moved to one last place – Builtin ASP.NET tracing.  We enabled the tracing because there is a VERY helpful column that gets produced - ViewState Size bytes.  This is done for each control that is rendered and makes short order of what controls are the worst offenders.  You will get output like this:

    Control ID

    Type

    Render Size bytes

    ViewState Size bytes

    ControlState Size bytes

    Control1

    System.Web.UI.WebControls.DropDownList

    0

    1084

    0

    Control2

    System.Web.UI.WebControls.DropDownList

    0

    1084

    0

    Control3

    System.Web.UI.WebControls.DropDownList

    0

    1084

    0

    So there we had it!  Control names and their ViewState Size.  We were able to quickly look at who the worst offenders are and start removing the unneeded stuff and fixing the application.

    Reading ASP.NET Trace Information
    http://msdn.microsoft.com/en-us/library/kthye016(VS.80).aspx

    I hope this adds some additional information this discussion of troubleshooting ViewState.

    Have a great day!

    Zach

  • PFE - Developer Notes for the Field

    Passing a Managed Function Pointer to Unmanaged Code

    • 0 Comments

    I was working with a customer a while back and they had a situation where they wanted to be able to register a managed callback function with a native API.  This is not a big problem and the sample that I extended did this already.  The question that arose was – How do I pass a managed object to the native API so that it can pass that object to the callback function.  Then I need to be able to determine the type of the managed object because it is possible that any number of objects could get passed to this call back function.  I took the code sample that is provided on MSDN and extended it a bit to handle this.  Here is the original sample - http://msdn.microsoft.com/en-us/library/367eeye0.aspx 

    The first commend here is that this is some DANGEROUS territory.  It is really easy to corrupt things and get mess up.  For those that are used to manage code this is very much C++ territory where you have all the power and with that comes all of the rope that you need to hang your self.

    As I mentioned about the customer wanted to be able to pass different types of objects in the lpUserData so the callback can handle different types of data.  So the way to do this is with a GCHandle.  Basically this gives you a native handle the object.  You pass that around on the native size and when it arrives on the managed side you can “trade it in” for the managed object.

    So let’s say you have some sort of managed class.  In this case I called it “callbackData” and you want to pass it off. 

    callbackData^ cd = gcnew callbackData();

    First you need to get a “pointer” to the object that can be passed off:

    GCHandle gch2 = GCHandle::Alloc(cd);
    IntPtr ip2 = GCHandle::ToIntPtr(gch2);
    int obj = ip2.ToInt32();

    You need to keep the GCHandle around so you can free it up later.  It is very possible to leak GCHandles.  In addition GCHandles are roots to CLR objects and therefore you are rooting any memory referenced by the object you have the GCHandle for so the GC will not release any of that memory.  This however is not as bad as pinning the memory.  The GCHandle is a reference to the object that can be used later to retrieve the address of the actual object when needed.

    From the GCHandle you can convert that to a an IntPtr.  Once you have an Int from the IntPtr that you can pass around to the lpUserData parameter.  You could also just directly pass the IntPtr but I just wanted to demonstrate taking it all the way down to the int. 

    Then you can pass that value as part of an LPARAM or whatever you want to the native code. 

    int answer = TakesCallback(cb, 243, 257, obj);

    That will get passed to your callback that uses your delegate to call into your managed function.  The native code might do something simple like this:

    typedef int (__stdcall *ANSWERCB)(int, int, int);
    static ANSWERCB cb;
     
    int TakesCallback(ANSWERCB fp, int n, int m, int ptr) {
       cb = fp;
       if (cb) {
          printf_s("[unmanaged] got callback address (%d), calling it...\n", cb);
          return cb(n, m, ptr);
       }
       printf_s("[unmanaged] unregistering callback");
       return 0;
    }

    In inside your callback function you simply get back an object:

    public delegate int GetTheAnswerDelegate(int, int, int);
     
    int GetNumber(int n, int m, int ptr) {
       Console::WriteLine("[managed] callback!");
       static int x = 0;
       ++x;
     
       IntPtr ip2(ptr);
       GCHandle val = GCHandle::FromIntPtr(ip2);
       System::Object ^obj = val.Target;
       Console::WriteLine("Type - " + obj->GetType()->ToString() );
     
       return n + m + x;
    }

    Once you have the object you can cast it or do whatever you need to move forward.  This approach allows you to bridge across the managed and native world when you have a native function that makes a callback.  I will attach a complete sample for you to play with if you interested and let me know if there are any questions.  There are other ways to approach this I am sure but this seemed to work pretty well and as long as you manage you rGCHandles and callback references you should be good to go!

    Thanks,

    Zach

    REFERENCES

  • PFE - Developer Notes for the Field

    .NET Framework 2.0 Service Pack 2

    • 1 Comments

    I have worked with several customers that want .NET Framework 2.0 SP2 but do not want to either require their customers to install .NET Framework 3.5 w/ SP1 or do not want to do it themselves.

    Well due to the way it was packaged up until this point that was the only way to get it.  However ask of just the other week they finally released a standalone installer for .NET Framework 2.0 Service Pack 2:

    http://www.microsoft.com/downloads/details.aspx?FamilyID=5b2c0358-915b-4eb5-9b1d-10e506da9d0f&displaylang=en

    This is great news and will make the deployment a bit easier for people on Windows XP and Windows Server 2003.  For Vista and Windows Server 2008 you still have to get .NET Framework 3.5 SP1 installed to get .NET Framework 2.0 SP2.

    Finally, make sure that you also install the following update - http://support.microsoft.com/kb/959209.  This update fixes some of the known issues that have been found with the service packs since their release.

    Have a great day and happy installing!

    Zach

  • PFE - Developer Notes for the Field

    Best Practice – WCF and Exceptions

    • 1 Comments

    Best Practice Recommendation

    In your WCF service, never let an exception propagate outside the service boundary without managing it. 2 Alternatives then :

    • Either you manage the exception inside the service boundary and never propagate it outside
    • Or you convert the .Net typed exception in your exception manager as a FaultException before propagating it.

    Details

    Most of the developers know how to handle exception in .Net code and good strategy can consist in letting an exception bubble up if it cannot be handled correctly, ultimately this exception will be managed by one of the .Net framework default handler which can depending of the situation close the application.

    WCF manages things slightly differently, indeed if the developer let an exception bubbles up to the WCF default handler, it will convert this .Net typed exception (which has no representation in the SOA world) as a Fault Exception (which is represented in the SOA world as a SOAP fault http://www.w3.org/TR/soap12-part1/ ) and then propagate this exception to the client. Up to this point, you think everything is OK, and you are right indeed it’s legitimate the client application receives the exception and manages it. However there are 2 problems here :

    • One minor, indeed by default you don’t control the conversion, and you may want to send a custom Fault Exception to the client application, something not too generic.
    • The WCF default handler will fault the channel and this is far more problematic because if the client reuse the proxy (what is a quite common approach after managing the exception), it will fail. It’s quite easy in .Net to work around this issue just by testing if the proxy is faulted in the exception manager but all the languages may not offer these facilities and I remind you by essence SOA is very opened.

    For those of you who wants more information this blog is very good : http://weblogs.asp.net/pglavich/archive/2008/10/16/wcf-ierrorhandler-and-propagating-faults.aspx from a general point of view, the implementation of IErrorHandler is a popular approach to handle this issue so from your favorite search engine (www.live.com) just search for IErrorhandler.

    Contributed by Fabrice Aubert

  • PFE - Developer Notes for the Field

    Debugging Internet Explorer Security Warnings

    • 2 Comments

    My name is Norman and I’ve been working with customers the past few years debugging a variety of problems, but maintaining a focus on Internet Explorer. This was an interesting issue I ran into with a customer the other day.

    ===================

    I know no one has ever been in a situation where you typed credit card information into a “secure” SSL web site and hit the Submit button only to the see the below security warning.

    clip_image002

    Personally speaking I’d cancel my order and never shop there again. This, though, is a pretty common scenario as of Internet Explorer 6 and is meant to help you. It is not telling you there is an error. Think of it as a warning sign of things to come.

    If you are lucky enough to have to debug this, it is pretty easy to do assuming you have some debug tools handy. Just grab a dump file or attach to the process (iexplore.exe) using your favorite debug tools.

    Steps

    Please note the register and hex values will vary.

    1. Switch to thread 0

    This wasn’t a wild guess although you could dump out all threads and look for a LaunchDlg function; rather all dialogs are on thread 0 in a windows application.

    0:013> ~0s
    eax=00600650 ebx=00000000 ecx=00422dc0 edx=7c90eb01 esi=0062d298 edi=00000001
    eip=7c90eb94 esp=001399f4 ebp=00139a28 iopl=0 nv up ei pl zr na pe nc
    cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000246
    ntdll!KiFastSystemCallRet:
    7c90eb94 c3 ret

    2. Show the full callstack

    0:000> kb100
    ChildEBP RetAddr Args to Child
    001399f0 7e419418 7e42dba8 00220324 00000001 ntdll!KiFastSystemCallRet
    00139a28 7e42593f 001503fa 00220324 00000010 USER32!NtUserWaitMessage+0xc
    00139a50 7e425981 771b0000 772431c0 001103d6 USER32!InternalDialogBox+0xd0
    00139a70 7e42559e 771b0000 772431c0 001103d6 USER32!DialogBoxIndirectParamAorW+0x37
    00139a94 77fa9eb1 771b0000 0000049b 001103d6 USER32!DialogBoxParamW+0x3f
    00139ab4 7722f51a 771b0000 0000049b 001103d6 SHLWAPI!DialogBoxParamWrapW+0x36
    0013a52c 7722dce4 001103d6 0013a55c 0000049b WININET!LaunchDlg+0x6c1
    0013a578 7dd35417 001103d6 00000000 7722f673 WININET!InternetErrorDlg+0x34f
    0013a5c0 7deaea5f 00000000 0013a678 00000001 mshtml!CMarkup::ValidateSecureUrl+0xf3
    0013e780 7de92e8b 7dcd0709 02d589e0 10a9ffa1 mshtml!CObjectElement::CreateObject+0x48d
    0013e784 7dcd0709 02d589e0 10a9ffa1 00000000 mshtml!CHtmObjectParseCtx::Execute+0x8
    0013e7d0 7dc9cf87 02d58ac0 02d589e0 7dcc4bad mshtml!CHtmParse::Execute+0x41
    0013e7dc 7dcc4bad 7dcc4bcb 10a9ffa1 02d589e0 mshtml!CHtmPost::Broadcast+0xd
    0013e898 7dcb4c7b 10a9ffa1 02d589e0 02d402d0 mshtml!CHtmPost::Exec+0x32f
    0013e8b0 7dcb4c20 10a9ffa1 02d402d0 02d589e0 mshtml!CHtmPost::Run+0x12
    0013e8c0 7dcb505f 02d402d0 10a9ffa1 02d589e0 mshtml!PostManExecute+0x51
    0013e8d8 7dcb4fe2 02d589e0 00000001 7dcb4038 mshtml!PostManResume+0x71
    0013e8e4 7dcb4038 02d58a50 02d589e0 0013e928 mshtml!CHtmPost::OnDwnChanCallback+0xc
    0013e8f4 7dc9cb7d 02d58a50 00000000 00000000 mshtml!CDwnChan::OnMethodCall+0x19
    0013e928 7dc98977 0013eac4 7dc98892 00000000 mshtml!GlobalWndOnMethodCall+0x66
    0013ea5c 7e418734 001103d8 00008002 00000000 mshtml!GlobalWndProc+0x1e2
    0013ea88 7e418816 7dc98892 001103d8 00008002 USER32!InternalCallWinProc+0x28
    0013eaf0 7e4189cd 00000000 7dc98892 001103d8 USER32!UserCallWinProcCheckWow+0x150
    0013eb50 7e418a10 0013eb90 00000000 0013eb78 USER32!DispatchMessageWorker+0x306
    0013eb60 75f9d795 0013eb90 00000000 00162f90 USER32!DispatchMessageW+0xf
    0013eb78 75fa51da 0013eb90 0013ee98 00000000 BROWSEUI!TimedDispatchMessage+0x33
    0013edd8 75fa534d 00162d68 0013ee98 00162d68 BROWSEUI!BrowserThreadProc+0x336
    0013ee6c 75fa5615 00162d68 00162d68 00000000 BROWSEUI!BrowserProtectedThreadProc+0x50
    0013fef0 7e318d1e 00162d68 00000000 00000000 BROWSEUI!SHOpenFolderWindow+0x22c
    0013ff10 00402372 001523ba 000207b4 00094310 SHDOCVW!IEWinMain+0x129
    0013ff60 00402444 00400000 00000000 001523ba iexplore!WinMainT+0x2de
    0013ffc0 7c816fd7 00094310 0079f0e0 7ffd7000 iexplore!_ModuleEntry+0x99
    0013fff0 00000000 00402451 00000000 78746341 kernel32!BaseProcessStart+0x23

    3. Dump out the second parameter to mshtml!CMarkup::ValidateSecureUrl

    0:000> du 0013a678
    0013a678 "http://download.macromedia.com/p"
    0013a6b8 "ub/shockwave/cabs/flash/swflash."
    0013a6f8 "cab#version=6,0,79,0"

    For more information on what ValidateSecureUrl does, please click here.

    4. Update the code

    So in my customer’s case it wasn’t their custom code. It was really just a reference to a Macromedia object. The easy fix for my customer was to change the “http” to “https” and in the eyes of their Internet customer had a secure site again.

  • PFE - Developer Notes for the Field

    A Few TFS 2008 Pre & Post Installation things you need to know

    • 1 Comments

    The PFE Dev team would like to welcome Micheal Learned to the blog.  Here is a bit about Mike:

    My name Is Micheal Learned, and I’ve been working for Microsoft for over a year now with some of our Premier customers across the US, and helping them support a variety of .NET related systems and tools.  I’m doing a lot of work currently with VSTS/TFS 2008, and helping customers get up and running with new installations, migrations, and general knowledge sharing around this product.  I also enjoy working with web applications, IIS, and just researching various new Microsoft related technologies in general.  In my free time I enjoy sports (I watch now, use to like to play), and spending time with my son Nicholas (7), and daughter Katelyn (3),  and generally just letting them wear me down playing various kids stuff.

    Enough about here is the good stuff:

    I’ve been spending a lot of time recently working with customers on TFS and VSTS 2008. Some of my customers have required some support around getting a clean install, or troubleshooting some post-install issues, so I thought I would post a few of the common issues I’ve been seeing.

    First of all, for any pre-install or post-install issues, I highly recommend you run the TFS Best Practices Analyzer, which is available as a “power tool” download here.

    The analyzer can run a system test, and generate a report indicating the health of your existing system, or a report identifying possible issues with your environment configuration ahead of a TFS installation (pre-install). The information is useful, and generates links to help files that can help you solve various issues. It helped me through a range of issues while troubleshooting an installation recently, including IIS permissions, registry information, etc. The idea is that more and more data collection and reporting points are being added to this tool, so you should keep an eye out for future releases of the tool as well.

    Before doing any new installation it is imperative you download the latest version of the TFS 2008 installation guide available here. I can’t overstate the importance of following the guide step by step with any new installation, and although this version of TFS is a smoother install than the previous version (2005), it is still important you pay close attention to setting up the proper accounts and their perspective network permissions. It may seem a little cumbersome, but you’ll want the TFS services, and various accounts running with the least sufficient permissions as a security best practice, and it is well documented in the guide.

    Enough about general stuff, let’s discuss some specific points:

    “Red X of Death”

    If after installation, you’re client machines are seeing a red “X” on either their documents or reports folders in Team Explorer, then they are encountering a permissions issue on SharePoint (if documents folder), or the Reporting Services layer (if reports folder). I’ve seen this to be quite common, and it should just be pointed out that you should configure proper permissions for your Active Directory users at each of those tiers. Visual Studio Magazine Online has a nice article documenting this here http://visualstudiomagazine.com/columns/article.aspx?editorialsid=2742 .

    “Firewall Ports”

    If you are running TFS, and have a firewall in play, you will need specific ports for the various TFS components to communicate properly. The ports are well documented in the TFS guide under the “Security Issues for Team Foundation Server” section, and there is some automation with new installations that will configure your windows firewall ports for you during install. If you need to configure them post-install, it is just a matter of setting up the firewall exceptions manually.

    “TFS Security

    TFS security is deserved of its own blog entry, but I want to just mention a few quick items since TFS security can be a common stumbling block initially. The TFSSetup account (as described in the TFS guide) is by default an administrator on the TFS server. You’ll need to configure permissions on several tiers for TFS for your other users and groups. Specifically configure permissions at the overall TFS server level, TFS project level, Reporting Services level, and SharePoint level. It may seem cumbersome at first, but just right click your server node in Team Explorer, and it becomes a point and click exercise. Right click the project node to configure project level permissions. As a best practice for manageability scenarios you will want to simply use Active Directory Groups to drive this, and if you have a need, you can get very granular for setting up permission, new roles, etc. Also there is a power toy available that gives you a singular GUI view for managing permissions all in one place that you can download here http://www.codeplex.com/Wiki/View.aspx?ProjectName=TFSAdmin .

    “Just remember TFS is multi-tier”

    Finally I would just point out to keep in mind that TFS 2008 is a distributed system with multiple tiers. Open IIS or SQL Server to poke around and look at the databases, directories, permissions, etc to familiarize your-self with the architecture (Do not directly edit the database – Look, don’t touch!).

    Many customers of course are often “viewing” their TFS server from Visual Studio, and if you see issues with connecting to a TFS project, connecting to a TFS server, or having issues with the various views on top of reports or SharePoint documents, you should initially keep in mind that the underlying reasons probably lie on a IIS site being down, SharePoint configuration, an improper permission, or a firewall port, etc. In a nutshell, focus on permissions and configuration settings at each tier, per the TFS install guide, and your issues are likely to be solved!

    Mike Learned --Premier Field Engineer .NET

    Resources

  • PFE - Developer Notes for the Field

    Windows Server 2003 Service Pack 2 and 24-bit Color TS Connections

    • 4 Comments

    Okay this is a really weird one!  When you install Service Pack 2 on Windows Server 2003 you can no longer connect using MSTSC via Terminal Services (TS)/Remote Desktop(RDP) with a 24-bit color setting.  For the customer I was working with this meant that the applications they are hosting on those servers did not look good at all over this down graded connection.  For them it meant a blocker to upgrading to SP2.  This has gotten more important recently because Service Pack 1 for Windows Server 2003 is about to be out of support.

    So a hunt ensued for how to get this working.  It turns out that there was a bug in Service Pack 2 that disable some of these higher bit rates.  The bug was corrected in KB – 942610 although this was not apparent that it applied to my problem since the description was different.  However it turns out that the same underlying issue was causing both problems.  A couple of other changes needed to happen and then everything was working.  Here are the steps that I came up with to take a Win2K3 SP2 box and enable 24-bit Color Connections:

    1. Install the following hotfix - http://support.microsoft.com/kb/942610
    2. Update the following registry key to enable the fix:

      Registry subkey: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Terminal Server
      Registry entry: AllowHigherColorDepth
      Type: REG_DWORD
      Value: 1

      Note that I found the hotfix does add this registry key but it had the value set to 0 so it disabled the fix.  I had to change it to 1.  Also, typical warnings about editing the registry apply.  If you go crazy you can toast your box so back up stuff first!  For those that are sticklers here is the official warning:
    3. Important This section, method, or task contains steps that tell you how to modify the registry. However, serious problems might occur if you modify the registry incorrectly. Therefore, make sure that you follow these steps carefully. For added protection, back up the registry before you modify it. Then, you can restore the registry if a problem occurs. For more information about how to back up and restore the registry, click the following article number to view the article in the Microsoft Knowledge Base:

      322756 How to back up and restore the registry in Windows

    1. Ensure the TS box is excepting 24-bit connections
      1. Go to Start->Run
      2. Type “tscc.msc” and hit enter
      3. Click “Connections”
      4. Right click on “Rdp-tcp” and select properties
      5. Click the “Client Settings” tab
      6. Ensure that “Limit Maximum Color Depth” is set to 24-bit
    2. Restart the server to ensure everything is set.

    Hope this helps!

    Thanks,
    Zach

  • PFE - Developer Notes for the Field

    Combining NetMon CAP Files

    • 1 Comments

    Today I was working on a problem that we thought was network related.  That meant it was time to try out some of those NetMon skills.  NetMon is the Microsoft Network Packet Capture and Analysis tool that is similar to WireShark/ethereal and can the current 3.2 version can be download for free from here - http://www.microsoft.com/downloads/details.aspx?FamilyID=f4db40af-1e08-4a21-a26b-ec2f4dc4190d&DisplayLang=en

    For the longest time I am a lot of my colleagues used WireShark/ethereal to do network captures.  However Netmon has come a long way and I really am a big fan of it now.

    Back to my problem today.  We were dealing with a Web Server that has a decent amount of traffic and problem that occurred intermittently.  I first started just capturing with the UI but quickly saw the memory grow and the box slow down a bit.  I did not need to see the traffic going by I just needed to do a capture.  This lead me to the NMCAP tool that ships with NetMon.  This is a basic command line tool that allows you do to do captures.

    So I started capturing and by default if you use the *.chn extension on the file name you specify it will create 20 MB CAP files with your network data.  This as great.  I could easily filter out the ones I did not want while I awaited the problem.

    Finally the problem reproduced and now I was ready to analyze.  I opened the trace closest to when the problem occurred and began working through the trace.  I quickly found some conversations that I was interested but – Where was the beginning? Where was the end?

    The answer to this – In some of the other traces!

    Now I had 3, 4, 5, 6 different traces open and that was no good.  I quickly wanted to be able to combine the 6 or 10 traces that I was interested it.  I found out that the same NMCAP tool can easily do this.  All you have to do is use the /InputCapture param.  You end up with a command line like this:

    namcap /InputCapture Trace1.cap Trace2.cap Trace3.cap /capture /file Trace.cap:500M

    This was a life saver so I thought I would pass it along!  Enjoy and have a great weekend.

    Thanks,

    Zach

  • PFE - Developer Notes for the Field

    Best Practice - Workflow and Anonymous Delegates

    • 2 Comments

    Best Practice Recommendation

    In your Workflow application (more exactly in the host of your workflow application) never use anonymous delegate like that :

                      AutoResetEvent waitHandle = new AutoResetEvent(false);
                      .
                      .
                      .
    
                      workflowRuntime.WorkflowTerminated += delegate(object sender, WorkflowTerminatedEventArgs e)
                      {
                                waitHandle.Set();
                      };

    Details

    The code above is very common and works correctly at least most of the time... Actually when you do something like that, then after a few iterations (if you always keep the same instance of the Workflow runtime as it should be the case) you will notice the number of AutoResetEvent object always increases until eventually an Out Of Memory exception is raised.

    If you take a debugger like Windbg and display the chain of reference for one particular instance of the AutoResetEvent class you will have an output similar to the following:

    DOMAIN(00000000002FF7E0):HANDLE(WeakSh):c1580:Root:  00000000029126e8(System.Workflow.Runtime.WorkflowRuntime)->
      000000000296e500(System.EventHandler`1[[System.Workflow.Runtime.WorkflowTerminatedEventArgs, System.Workflow.Runtime]])->
      0000000002966970(System.Object[])->  00000000029570e0(System.EventHandler`1[[System.Workflow.Runtime.WorkflowTerminatedEventArgs, System.Workflow.Runtime]])->
      0000000002957038(TestLeakWorkflow.Program+<>c__DisplayClass2)->  0000000002957050(System.Threading.AutoResetEvent)

    If now you dump the instance of TestLeakWorkflow.Program+<>c__DisplayClass2 you will have the following output :

    MT                    Field            Offset        Type               VT         Attr                 Value               Name
    000007fef70142d0      4000003          8             AutoResetEvent     0          instance             0000000002957050    waitHandle

    What does it tell us? This class TestLeakWorkflow.Program+<>c__DisplayClass2 contains one property of type AutoResetEvent. In our case this property has a strong reference to the object of type AutoResetEvent, consequently the final object is still referenced and remains logically memory.

    Why that? Well the purpose of this entry is not to give details about anonymous delegate, but basically there is nothing magic with anonymous delegate, and for the delegate function to access to variable which are not in its scope (look closer, the waitHandle object is not in the scope of the delegate function but still it can access to it) a mechanism is needed. For that, the compiler creates an intermediate class with one property per object which is used in the function delegate but which is normally not in its scope, the delegate function is just a member function of this class; it can then easily access to the properties of the class.

    What if you use hundred of objects in the anonymous delegate, objects which are normally not in the scope of the delegate? Well, you will have an intermediate class with hundred of attributes. The instances of the class will then reference the real objects...

    Why is that a problem? Well, again look closely? How can you get rid of this instance? You have to remove the reference to the delegate but you cannot (not with the syntax used above), it means the intermediary class and more problematically the final object are still referenced and remain in memory.

    To better understand the problem, let’s now study the fragment below, this is a quite classical fragment of code that you will often find in multiple blogs (I know it’s from one of this blog that we have got the code that have been implanted in production and on which I have spent hours debugging to understand why the portal was dying after several hours) :

            WorkflowInstance instance = workflowRuntime.CreateWorkflow(typeof(MyASPNetSequencialWorkFlow));
    
            instance.Start();
    
            workflowRuntime.WorkflowCompleted += delegate(object o, WorkflowCompletedEventArgs e1)
            {
                if (e1.WorkflowInstance.InstanceId == instance.InstanceId)
                { ...

    Yes you got it, in this case this is the Workflow instance which will be referenced and which will stay in memory. Considering the fact that in a web application host (as well as in Windows service host) the WorkflowRuntime is a long live object, you will for sure have an Out Of Memory exception.

    What do then? Well a code like the one below is definitely better :

                        EventHandler<WorkflowTerminatedEventArgs> terminatedHandler = null;
                        EventHandler<WorkflowCompletedEventArgs> completedHandler = null;
    
                        terminatedHandler = delegate(object sender, WorkflowTerminatedEventArgs e)
                        {
                            if (instance.InstanceId == e.WorkflowInstance.InstanceId)
                            {
                                Console.WriteLine(e.Exception.Message);
                                workflowRuntime.WorkflowCompleted -= completedHandler;
                                workflowRuntime.WorkflowTerminated -= terminatedHandler;
                                waitHandle.Set();
                            }
                        };
    
                        workflowRuntime.WorkflowTerminated += terminatedHandler;
    
    completedHandler = delegate(object sender, WorkflowCompletedEventArgs e) { if (instance.InstanceId == e.WorkflowInstance.InstanceId) { WorkflowInstance b = instance; workflowRuntime.WorkflowCompleted -= completedHandler; workflowRuntime.WorkflowTerminated -= terminatedHandler; waitHandle.Set(); } }; workflowRuntime.WorkflowCompleted += completedHandler;

    You again have to be prudent, you must remove all the handlers, if at the end WorkflowCompleted Handler is executed, WorkflowTerminated handler will never be executed hence the all the handler are removed in both delegates.

    Thanks,
    Fabrice

  • PFE - Developer Notes for the Field

    Investigate High CPU for a Process: Part 1

    • 0 Comments

    Hello World! This is my 1st ever blog and I hope there is more to come after this. So my name is Leo Leong and I am one of among many in the field that goes out to attend situations when there is a dev issue. In many ways, I've always thought that we are like the "mighty mouse" in our field, "Here I come to save the day...". Well, it's not always like that but that's how I'd like to think of us :)

    Alright, enough said and let's get back on track with the real deal. So the real world has proven to be a much more complicated arena but for simplicity sake I've created an academic sample here to illustrate what we are going to accomplish. And that is to find out what a process is doing when you see it is taking up much of your system's CPU. Imagine the following scenario:

    I have develop a web site and whenever I browse to a page, it pegs my machine's CPU. What is it doing?

    Basically, there are quite a few tools that come to mind but I will introduce 2 of them in this 1st part because they are platform (x86/x64) and .NET version agnostic. But more importantly, we can't use Visual Studio!

    The first tool is called Process Explorer. You can find more details and download here. It is a very GUI driven tool and does a good job in terms of presenting what you might be interested within any given processes. Here's what it looks like when I looked at the stack trace for the worker process's (w3wp.exe) threads and specifically the thread with the highest CPU:

    HighCPU_Part1_ProcExp

    Note, you will have to configure the symbols to be able to see the function names shown in the above screen shot. To do that simply go to Options | Configure Symbols... and put in the Symbols path such as the following:

    HighCPU_Part1_ProcExp_Symbols

    I've made a number of assumptions here and that is you understand what a process, thread and stack means. Also, I'm assuming you are aware of what managed and native code is. But regardless, the essence of this tools is that it gives you a stack trace of any threads within a process which makes it a very powerful tool.

    So, we are half way through this problem and discovered what the stack looked like for the thread in question causing high CPU. Because this is a .NET application (more accurately ASP.NET), we need something more to dig in further. Hence, we bring out the next tool called Windbg. More details and download can be found here. This is pretty much the de facto tool used within Microsoft to debug any issues and you'll hear this a lot from us. Now to explain all the details of what to do when debugging a managed application is yet again outside the scope but here's the command you can use to get the stack traces for all managed threads:

    ~*e!clrstack

    When you attach the debugger (Windbg) to the worker process and issue the above command, this is roughly what it will look like:

    HighCPU_Part1_Windbg

    The command used above is part of the manage extension called SOS and you will need to load it in the debugger before you can debug any manage application. In case you are not already familiar with, try the following to load SOS:

    .loadby sos mscorwks

    So, what can we conclude so far? Well, save to say the function HighCPULoop_Click is the root cause of the problem here. A for loop that goes from 0 to int.MaxValue/2 which is about 1 billion, is a lot to loop through!! Once again, this is only an academic sample that I've wiped up in my spare time to represent a high CPU scenario. There are lots of reason why a good for loop can go bad :)

  • PFE - Developer Notes for the Field

    WinDBG and Hangs When Debugging Managed Dumps

    • 0 Comments

    Because this is my first post on this blog, let me introduce myself.  My name is John Allen.  I’ve been on the PFE team for 5 years and been with Microsoft for 9 years.  All of those years I’ve been debugging and troubleshooting all kinds of customer applications.   I focus on all developer technologies and products.

    One of the main things that myself and the others on the PFE - Developer team do is capture and analyze memory dump files.  A majority of the time we are debugging .NET applications and have found that sometimes the debugger will hang while trying to analyze a dump.  There are a couple of key characteristics:

    1. This is a 100% CPU Hang and the debugger does not recover - If your machine has more than one CPU the debugger will only be fully utilizing one of them.
    2. The Hang occurs when loading symbols for a managed assembly
    3. The Application that was dumped is .NET Framework 2.0 or greater.

    We will talk more about symbol loading in a subsequent post but here is a workaround for this problem:

    1. Create an empty SymSrv.ini file in your Debuggers Directory (Default - C:\Program Files\Debugging Tools for Windows).  For more information check out the Symsrv.ini documentation on MSDN - http://msdn2.microsoft.com/en-us/library/ms681416.aspx
    2. Then open that file and add the following:
    [exclusions] 
    
    System.Windows.Forms.pdb
    System.pdb
    System.Web.pdb

    After you have done this you can try loading the dump file again and see if the hang occurs again.  If this does not resolve your problem in run the following command:

    0:020> lm m *_ni start end module name 
    
    637a0000 63d02000 System_Xml_ni (deferred) 64890000 6498a000 System_Configuration_ni (deferred) 65140000 657a6000 System_Data_ni (deferred) 65f20000 66ac6000 System_Web_ni (deferred) 698f0000 69ad0000 System_Web_Services_ni (deferred) 790c0000 79b90000 mscorlib_ni (deferred) 7a440000 7ac06000 System_ni (deferred)

    This will dump all of the native image assemblies - these are the assemblies that have been nGened and are loaded from the nGen cache.  Then for each assembly that is listed add them to the [exclusions] list in the SymSrv.ini file.  Be sure to remove the _ni, replace it with .dll and replace the remaining underscores with a dot.  For instance in the above output the [exclusion] list would look like:

    [exclusions] 
    System.Xml.pdb 
    System.Configuration.pdb 
    System.Data.pdb
    System.Web.pdb System.Web.Services.pdb
    mscorlib.pdb
    System.pdb

    Finally, if you are looking for additional information on managed debugging you should check out Tess's blog - http://blogs.msdn.com/tess/

    Thanks,
    John Allen

     

  • PFE - Developer Notes for the Field

    Exploratory Testing in Visual Studio 2012

    • 3 Comments

    With the recent release of Visual Studio 2012, there are many new features and updates to explore.  One of the new features is Exploratory Testing, which allows you to test your software without a pre-defined test or script.  It’s perfect for simply “exploring” your application to see what issues you may encounter.  Exploratory Testing comes with the Test Professional, Premium, and Ultimate editions of VS 2012.

    Technically, Exploratory Testing is part of Microsoft Test Manager (MTM).  In VS2010, one could achieve exploratory testing by filing an exploratory bug in Test Runner.  From there, users/testers could execute unscripted actions until a bug was found.  But in VS2012, exploratory testing is pulled into MTM along with your other tests, plans, suites, tracks, work items, etc. You can learn about some of the other new features in MTM 2012 in MSDN Magazine.  In order to use Exploratory Testing in MTM, you must be connected to Team Foundation Server (TFS) 2012.  If you don’t have access to a TFS 2012 instance, you can try out TFS in the cloud free, for a limited time, at http://tfspreview.com.  The instructions at that site are simple and include everything you need to create an account, project, upload source, automate builds, etc. 

    NOTE: At the time of this writing, no information was available as to how long tfspreview.com would be available (or at least, available for free).  While the end goal of TFS Azure is to have all the features of TFS, you may find that not all features of TFS - such as Reporting Services, AD Federation, etc. - are available at tfspreview.com.

     

    With exploratory testing, you can literally browse through your application at your leisure, until you find a bug or some other part of the software that should be considered for change – editing text in a label, background color change, adding or removing a feature, etc.  And the whole time you’re “exploring” the application, MTM records your actions.  By default, actions you execute in MTM or Office applications are not recorded, though you can change this in your test plan properties (Testing Center, Plan, Properties).  Exploratory testing in MTM allows you to do the following:

    • Pause and resume recording of your exploratory test case
    • Enter comments and screenshots for your exploratory test.  This can help clarify recorded activities in your test.
    • Create a bug
    • Generate a Test Case for the bug you just created

     

    Exploratory testing can be used on web applications, windows applications, and if it’s a client/server application, some steps on the server side can also be recorded.  More details on this, as well as the steps to run an exploratory test can be found at http://msdn.microsoft.com/en-us/library/tfs/hh191621(v=vs.110).aspx.  While I have no desire to copy what can already be found on MSDN, I also know that sometimes links change, content is moved, etc.  Therefore, I’ll repeat some of those steps in this blog entry.

    1. Open MTM.
      1. You can perform exploratory testing with no existing work items.  To do this, simply right-click a Test Plan or Test Suite and select Explore.

    b. If you do have an existing Work Item, you can perform exploratory testing on a specific work item by going to Test, Do Exploratory Testing.  Then right-click on your work item of choice and select Explore.

     

    2. When you select Explore, the Exploratory Testing window opens, waiting for you to Start your test.  Once clicked, MTM records your actions.

     

    3. You can pause or end the test, and this pauses or ends recording of your actions. At any time you can enter comments and/or add screen captures in the Exploratory Testing window.  Additionally, if you run into a bug you can click the Create Bug button during recording, and fill out details of your newfound bug.  When you save your new bug, you’ll have the option to create a Test Case for the bug you just created.

     

    These are just the steps to get you started, and there are many other options for saving Test Cases, assigning & editing bugs, and more.  Exploratory testing is a great option when you don’t have a defined test, a defined script as to how to navigate directly to a specific bug, or even when you don’t have a specific idea of what problem you seek to uncover in your application. 

  • PFE - Developer Notes for the Field

    PowerShell: Restoring a whole heap of SQL Databases

    PowerShell is one of those things that falls into my “other duties as assigned” repertoire.  It’s something that I’ve used for years to get things done but it’s not often I encounter a Dev at a customer that has worked with it much.  In my...
  • PFE - Developer Notes for the Field

    ASP.NET and Unit Tests

    • 1 Comments

    I was onsite the other day and the customer wanted to use Visual Studio 2005 to auto generate the Unit Test stubs for their ASP.NET application.  They have a lot of rules tied up in the ASP.NET application project.  When we tried to generate the Unit tests we kept getting the following error:

    Source Code cannot be sufficiently parsed for code generation.  Please try to compile your project and fix any issues.

    Well, needless to say the project compiled fine (and had been for a while) and ran fine.  We could not find any issues to fix.  My first thought was that this was a limitation because of the way they created their ASP.NET project.  So we got some details:

    1. Originally Created in VS.NET 2003
    2. Migrated to VS 2005
    3. They did not convert to the new Page Model with App_Code and such.

    After doing some digging I found out that the Unit Test Generation for some reason expects that the project has at least an empty "Settings."  So we opened the properties on the project and clicked Settings.  Sure enough there was nothing there.  So we clicked the link to create the "Settings" and got the grid that would enable us to add settings.  We just left it blank and rebuilt.  At that point we retried the generation of the Unit Tests and everything went smoothly.

    Thanks,
    Zach

  • PFE - Developer Notes for the Field

    SN.EXE and Empty Temp Files in %TEMP%

    • 0 Comments

    If you have a build server and are doing delay signing this is probably of interest to you.  When delay signing the final step is to post build run the following command:

    sn -R myAssembly.dll sgKey.snk

    I have seen build setups that basically output all binaries to one folder and then run a loop across all the DLLs executing this command.  That way everything is fixed up and ready for the devs to run.  This way you do not have to worry about adding a post build step to each project or you can add one common script to each project.  The release builds of course get fully signed but for daily/rolling builds this works just fine.  Until you check out your temp directory. 

    It turns out for each call to SN.EXE a TMP file is generated.  These TMP files look like this:

    C:\temp>dir
    Volume in drive C has no label.
    Volume Serial Number is 1450-ABCF

    Directory of C:\temp

    12/11/2007  12:39 PM    <DIR>          .
    12/11/2007  12:39 PM    <DIR>          ..
    12/11/2007  11:17 AM                 0 Tmp917.tmp
    12/11/2007  11:17 AM                 0 Tmp91C.tmp
    12/11/2007  11:17 AM                 0 Tmp921.tmp

    You probably noticed that these are 0 byte files which means space is not an issue but if you have a ton of these it can slow your hard drive down.  Also, the actual SN.EXE process can start slowing down. 

    If you have 100 DLLs and are doing 30 builds a day that is 3,000 temp files.  Now, let’s say you build debug x86, debug x64, release x86 and release x64 that is now up to 12,000.  After a couple of days you could imagine how many files. 

    I was working on this and we found out that SN.EXE is not actually the problem here (Thanks to Scot Brennecke).  The issue is in the Crypto APIs that SN.EXE uses.  These APIs are the ones that create and do not clean up the TMP files.  It turns out that there is a hotfix for this if you are using Windows Server 2003:

    On a Windows Server 2003-based client computer, the system does not delete a temporary file that is created when an application calls the "CryptQueryObject" function
    http://support.microsoft.com/kb/931908

    After applying this hotfix the temp files are no longer created and life was happy.  I just thought I would share because the connection between this hotfix and the problem was not immediately obvious.

    Have a great day!

    Zach

Page 1 of 3 (75 items) 123