Welcome to MSDN Blogs Sign in | Join | Help

MSAA, UIA brief explanation

MSAA, UIA brief explanation

 

中文链接:

http://eparg.spaces.live.com/blog/cns!59BFC22C0E7E1A76!4008.entry

 

 

What is MSAA:

http://msdn.microsoft.com/en-us/library/ms971310.aspx

 

MSAA is the same as IAccessible Interface.

 

The interface is originally designed for disabled people. While in most cases, it is used as an implementation method of UI Automation. However, this interface is not rich enough to fit into all aspects of UI Automation, as it only provides the name, type, positions of the target elements. It does not expose other functions like ClassName, Orientation, nor the operation supports like Window Move, Selection, to say nothing of Event Support.

 

As a result, Microsoft believes there is a good reason to provide dedicated interfaces for UI Automations:

 

IAccessibleEx

IRawElementProviderSimple:

 

http://msdn.microsoft.com/en-us/library/dd561900(VS.85).aspx

 

In MSDN category, the two interfaces belong to Windows Automation API, not MSAA.

 

The first interface IAccessibleEx, sounds like MSAA’s IAccessible, but it is just a proxy for the Automation Client to obtain the IRawElementProviderSimple. The implementation of pattern and property are in IRawElementProviderSimple. IAccessibleEx lives for only two reasons:

 

1.      Provide the MSAA client a familiar way to use the new UI Automation API. The MSAA client is able to query for IAccessibleEx interface and consume the new feature without using the new style of UIA API. The client code can still work in COM style.

2.      Provide the UI control programmer the MSAA way to implement and expose new UI Automation feature for the control. The programmer just needs to implement the IAccessibleEx interface for the control in COM style, and the pattern and property UI Automation features are accessible from new UIA APIs.

 

http://msdn.microsoft.com/en-us/library/dd561882(VS.85).aspx

 

IAccessibleEx is just a bridge. In Win7, the implementation for the bridge is in UIAutomationCore.dll.

 

To use UI Automation, the client could choose to use UIA API directly, or, use the old style, which is obtaining IAccessible, and QueryService (not QueryInterface because the system needs to do some hook) for IAccessibleEx, and then just use it or continue to query for IRawElementProviderSimple.

 

After getting the IRawElementProviderSimple, it is very easy to consume the patterns and providers:

 

http://msdn.microsoft.com/en-us/library/ms746691.aspx

 

Please note, the AutomationProvider could be implemented in either UI Server or the Client side. To implement in UI Server, the control could derive from IRawElementProviderSimple and return the client request like this:

 

if ((m.Msg == WM_GETOBJECT) && (m.LParam.ToInt32() ==
                AutomationInteropProvider.RootObjectId))
            {               
                m.Result = AutomationInteropProvider.ReturnRawElementProvider(
                        this.Handle, m.WParam, m.LParam,
                        (IRawElementProviderSimple)this);             
                return;
            }

 

The client implementation benefits the old Win32 control as we do not need to modify any existing implementations for the control. For client side provider, it is also called as Provider/Automaton Proxy. In client implementation, the ProviderOptions property should return ProviderOptions.ClientSideProvider.

 

For Win32 and WinForm default implementations, they are in client side. The assembly is UIAutomationClientSIdeProvider.DLL. The ProxyHwnd.GetElementProperty’s implementation is very typical:

 

Getting UIAotmationID by sending WM_GETCONTROLNAME message

Getting Name property by reading MSAA’s Name property first, calling GetWindowLong and GetWindowText API the second and trying PostMessage for last.

 

I just found an old Automation Paper created in 2004:

http://msdn.microsoft.com/en-us/library/ms996405.aspx

 

The author Brian McMaster, who is a Test Architect now, educated people to use WM_GETCONTROLNAME message to get control ID, take care of the hierarchy of UI control, and watch out for the case that window owner and window parent are different, in the year when UIA was absence.

 

With above understanding, it is quite easy to implement the server side for WinForm:

 

1.      Override WndProc by calling AutomationInteropProvider.ReturnRawElementProvider to return WM_GETOBJECT message.

2.      Implement IRawElementProviderSimple and return ProviderOptions.ServerSideProvider.

3.      Implement the providers like ISelectionProvider.

 

The full code is available in the links above.

 

For UIAutomation client, besides the IAccessibleEx I talked above, there are two kinds of UIAutomation API, managed and unmanaged.

 

The managed API is in System.Windows.Automation namespace. The class name is AutomationElement. The functions are very direct to use.

The unmanaged API is a group of functions whose name starts with UIA.

 

In current design, the managed API uses unmanaged API internally, while the provider used in the unmanaged API are implemented in managed code. The relationship is a bit complex. And there are some features work fine in unmanaged API but may fail in managed API because of bug. In Win8, we may merge the implementation to reduce the maintenance cost.

 

The user is not allowed to add (extend) new property and pattern in UIA. You cannot register your new pattern provider or property. You cannot even use the functions from the interface directly:

 

http://msdn.microsoft.com/en-us/library/system.windows.automation.provider.automationinteropprovider.hostproviderfromhandle.aspx

 

"The interface returned by this method can only be passed back to UI Automation. Attempting to call a method on the interface will raise an exception"

 

If you really want to do it, follow the steps to get IAccessibleEx, and do your own QueryInterface.

 

Above explanation is for Win32 and WinForm. For WPF, it uses AutomationPeer to implement UIAutomation.

 

The communication between client and server for native UIAutomation implementation (AutomationPeer and ReturnRawElementProvider) is named pipe. The client provider with MSAA is by DCOM and Windows Message.

 

Additional links:

 

UI Automation Community Promise Specification

http://www.microsoft.com/downloads/details.aspx?familyid=A1FE1066-BF4F-44FC-834B-676B311E83A2&displaylang=en

 

UI Automation of a WPF Custom Control

http://msdn.microsoft.com/en-us/library/cc165614.aspx

 

The followings two docs are created by Microsoft, but I cannot find the link from MSDN. They explains UIA’s design goal, IAccessibleEx and IRawElementProviderSimple’s relationships very well.

 

[added on 9/24/2009]
Clientside provider information doc:
 
 
sxe ld UIAutomationClientsideProviders*
[/added]
 

http://www.accessinteropalliance.org/docs/Introducing_IAccessibleEx.doc

http://www.accessinteropalliance.org/docs/What_is_UIAutomation.doc

Posted by lixiong | 1 Comments

Section 3: UI Automation execution covers test hole which normal test cannot

Section 3: UI Automation execution covers test hole which normal test cannot

 

In previous sections, I discussed the UI Automation preparation and development. I want to deliver the message that UI Automation is not test repeating happens automatically. It is unique. UI Automation execution does not only compensate human flaw and costs, but also cover the test path that other testing cannot.

 

The benefits of UI Automation are obvious. Testers do not need to repeat the boring tasks. Regressions are found earlier. No human mistakes and the test results are accountable. It is cool but not limited above.

 

1.      There are bugs that can be easily found by Automation, but difficult by manual

 

Real case one:

 

Our project is WPF based. The feature is something like Window explorer. It contains the address bar and when the user inputs the path, the application loads the data from that path.

We have a test case to start the application, type some path, and check the data loaded. In manual testing, it is perfect. It works every time. We developed the Automation case to perform the same flow. The Automation crashes the target application when it executes the first time.

 

The root cause is about threading. When the application starts, it creates a worker thread to perform some initialization task. In manual testing, as the user type speed is not so fast, the worker thread exists safely before the application starts the data loading from the path. UI Automation makes the whole execution pretty fast.  The application starts to load the data before the worker thread exists, so the initialization does not complete and leads to data corruption with crashe. When the root cause is identified, we start the application and paste the path to load instead of type, and the crashes happens manually.

 

Real case two:

 

In our UI Automation, we wrap the Automation task in test helper classes. We have create, delete, rename, move, add group, lock/unlock and other tasks. With Automation, we can easily generate flows to execute these tasks, with specified pattern or total random.

 

The auto generated test flow find two special bugs. It generates a random test case like

 

a) Create userA

b) Add userA groupB

c) Add userA groupB

d) Add userA groupB

 

The test execution fails at step d)! The problem only happens when you try to perform the same add group task the 3rd time! The root cause is, adding user to the same group more than once leads error. The error flag is not handled correctly in step c), thus the problem happens in step d).

 

In manual testing, 99.99% tester stops at step c). Nobody will test by “adding to the same group” once, once again, and again and again.

 

Another similar issue is, object move task works fine in 99% cases, except move the object from current location to grand parent.

 

Add an example from One Note team:

http://blogs.msdn.com/johnguin/archive/2008/09/08/investigating-an-automation-failure.aspx 

 

The UI Automation execution pattern is different from human. It is unique at detecting threading issue, randomization and strange combinations.

 

2.      There are test path and coverage can be easily reached by Automation, but difficult by manual

 

We have a test pass to measure the memory usage by running daily task for the UI application 24*7. Are you going to hire a monkey?

 

Stress testing is a typical example for test path which can only be covered by Automation.

 

Another real case is permutation. Our application is designed to handle different types of object. However, it is difficult to predict that all the types are known for us. The only way is to go through every folder and sub folder, to load every objects in the whole organization. With Automation, it is very simple to do it recursively. Eventually, we find a bug which corrupts the whole navigation feature. We made an incorrect assumption that the indicator between child and parent in path goes with a static pattern. It turns out the pattern is configurable. I cannot image if the product gets shipped with this kind of bug.

 

Summary:

 

I skip other obvious benefits Automation can bring. If the three topics give a clear definition of what UI Automation is, what it brings and what you need to prepare, it is unnecessary to discuss when to apply manual and when to use automation. Triage by yourself.

 

Posted by lixiong | 2 Comments
Filed under: , ,

Section 2: UI Automation development is an important, unique, and effective bug finding process

Section 2: UI Automation development is an important, unique, and effective bug finding process

 

If you think UI Automation is no more than repeating the manual testing automatically, you are completely wrong.

 

Many people expect to record the manual testing, and reply. They think tester just designs the test case once, records it once, then all the boring repeating jobs are auto! They try all the best to get there and they believe reducing cost is the only benefit Automation can bring. No, they are wrong. First of all, please check my previous post. This kind of recording/reply Automation is not a good Automation. It is not flexible. It cannot handle various inputs to verify. It cannot suit when resolution or window starting position changes. This kind of UI Automation is a typical failure. If your assumption on UI Automation is above, UI Automation will definitely hurt you.

 

In this post and the next, I will continue to discuss the meaning of UI Automation. I will focus on UIAutomation development here. Next I will focus on execution.

 

UI Automation development (before executing) is a unique bug finding process. It is NOT a combination of any existing testing mechanism plus automation. It is NOT derived from manual testing. It is unique, high effective and great coverage test mechanism which can compensate all the other existing test methodologies.

 

1.      Accessibility and localization bugs are exposed naturally when developing the UI Automation.

 

The first step for any UI Automation is to access the target UI elements. The tester needs to find a proper way to probe, identify the target UI element, and then read the information from UI to verify. If the tester cannot find AutomationID, AutomationName, and shortcut key to execute UI task, bugs are found! Most of the accessibility issues are exposed naturally in this stage.

 

A good UI Automation should work well for localized build. The tester needs to write code to work with product resource file and resource key. If some string is hard coded in product, the bug is also exposed naturally.

 

Actually, Accessibility is very serous for Microsoft products. If an Automation name is not well defined, it is a S2 bug, the same as major function break. In early stage, our dev team did not know UI Accessibility requirements very well. The test team could help a lot in this stage to prevent hundreds of S1/S2 bugs.

 

2.      The tester needs to solve the same problem as developer does. More people working on the same problem means more coverage. (This is not limited in UI Automation. It applies to all kinds of Automation tests)

 

In manual test, the tester starts the UI and plays with it. Most of the functions are visible and the tester will be influenced by the UI. The tester will try different patterns and different data to break it. Probably the tester will pre-define the test cases. However, these cases are usually based on scenarios, patterns, boundaries and other test metric.

 

It is great. It is important and valid test scenarios. We should continue to invest and apply these tests. However, UI Automation development brings another test strategy which compensates a lot.

 

In the manual tests, or for the pre-defined test cases, the think pattern is usually: can the product meet this kind of corner requirement? UI Automation development brings another think pattern: how can the problem be solved and all the requirements be met correctly?

 

To make a good and flexible UI Automation case, the tester cannot rely on static input/output. The tester has to solve the same problem just as the developer does. When the tester starts to work on the solution, the tester faces the same set of issues the developer needs to handle with. It includes the corner cases, exception/error handing and assumptions. It makes the tester understand the problem well, to really think about what a good solution can be. It helps test coverage because the tester is suffered with the same problem, and it also kills invalid test cases. In this process, the tester may want to proactively review product code, so it leads to more benefits:

 

l         Effective code review and design review with specified goal in mind. It is the most effective way to find bugs.

l         Complex measurement like performance can be easily covered in early stage from review.

l         It helps the communication between dev and test

 

Put notepad as an example. Notepad reorganizes ANSI/UNICODE/UTF8/UTF16 formats automatically. UI Automation for this case should operate notepad to open different kinds of file, and check the string showing in the window. How could the test know the expected string? The tester has to write the code to reorganize formats just like the notepad.exe does. (Of course you can generate string first, and save to different formats. But you know what I mean)

 

Microsoft defines a special field in bug database to identify how bug is found. “Automation development” is a very frequently used value for me.

 

3.      The tester has to triage the severity for UI Automation test tasks. It makes the important features tested first.

 

You can play around for everywhere in manual testing. However, you do not have the resources to create Automation to cover all the features. You have to put important thing first when creating the Automation.

 

Managers may like this. Manager cares about the priority and severity. They can bear tons of S3/S4 bugs but they hate surprise in major functions. However, testers trend to show how strong they are to find corner issue no matter how corner it is!

 

With UI Automation, the manager can assign challenging tasks for ICs. ICs are happy to take the challenging and interesting tasks. Manager is happy because testing goes with severity. ICs are happy because they are finding bugs in a challenging and effective way!

 

In previous post, I discussed the meaning of being able to create UI Automation before start. In this post, you see the meaning of creating UI Automation before executing. UI Automation can give you so many even before it starts.

Posted by lixiong | 2 Comments
Filed under: , ,

Section 1: UI Automation means more to get a test result

My understanding of UI Automation

 

Section 1:

UI Automation means more to get a test result

 

Section 2:

UI Automation development is an important, unique, and effective bug finding process

 

Section 3:

UI Automation execution covers test hole which normal test cannot cover

 

 

Section 1: UI Automation means more to get a test result

 

UI Automation is an interesting topic. Before discussing the complexity, benefits, coverage, test holes, maintenances costs and other controversial factors involved in UI Automation, the meaning when a tester can create a successful UIAutomation case is somewhat different.

 

How UI Automation works? How the textbox gets filled, button gets clicked and window gets opened and closed automatically. It seems quite a magic but a successful UIAutomation is not limited on just making it work. Successful UIAutomation means at least but not limited:

 

1.    Stable. A succeed case should always succeed no matter it executes in a powerful x64 machine, or a slow x86 VM. It should give the same result when the initial window position changes. It can test localized build.

2.    Performant. UIAutomation should runs faster than manual execution because it is controlled by machine! It is very bad if the test sleeps for a specified time between each UI operation.

3.    Flexible. The input and expected output should be configurable. If the test executes against large input, it should still generate the same pass/fail result. The execution time should be liner.

 

Given a Visual Studio 2008, how can you create such an UI Automation case to open notepad, type ABC, click file -> save menu, and type a filename in the dialog?

 

If you do not know Windows Message quite well, the most direct way is to use SendInput to simulate keyboard/mouse input. It works but the problem is:

 

1.    You have to calculate the Window position. The initial window position changes and different resolution/DPI settings will be a nightmare.

2.    You have to Sleep between each SendInput. For notepad, there is a timeframe between clicking File menu item and showing the sub menu. If test case does not wait, the clicking on save menu may be missed.

 

Window Message is a silver bullet? It works for most of the time, but the new problem is more difficult to solve:

 

1.    If the target application is .NET WinForm, the HWND tree may change at runtime. If you hide a tree control, and show it, the Win32 HWND for the tree control is likely to change. WinForm also creates some transparent HWND child which may break the Automation.

2.    There are a lot of traps to work with Windows Message. Suppose you want to set focus on the textbox of the target application, just calling SetFocus and passing the textbox HWND? No, please read:

"The SetFocus function sets the keyboard focus to the specified window. The window must be attached to the calling thread's message queue."
http://msdn.microsoft.com/en-us/library/ms646312(VS.85).aspx


3.    The SendMessage API returns only when the target application finishes the handling. However, it does not solve the sync issue between tester app and testee app completely.

4.    What if the target application is WPF based??????!!!!!!

 

Let me stop showing how difficult it is. (It is indeed very difficult). Microsoft combines SendInput, Windows Message and the Accessibility support for UI element to create internal UI Automation tests.

 

For who do not know Accessibility support, please check:

http://msdn.microsoft.com/en-us/library/ms696165.aspx

 

Generally speaking, UIAutomation is just an inter-process communication. Like every other normal application, inter-process means:

 

1.    How and where the communication happens? Windows Message, Socket, RPC, Named Pipe, Memory mapping file, WCF or even DLL Injection? (I will talk WCF and DLL Injection for UIAutomation)

2.    Synchronization. How the test client knows a UI instruction finishes on target side, so that it can send the next instruction? How the test client distinguishes an initialization window and an initialized window so that it will block the button clicking before the button gets visible?

 

I can talk more on details later. This is how our UI Automation Framework makes it (at least but not limited):

 

1.     Most of the UI Automation is DCOM invocation. UI element exposes IAccessible COM interface. WPF UI element also supports it. Check the following for more info:
http://msdn.microsoft.com/en-us/library/ms747327.aspx


2.     SendInput, Windows Message and notification are widely used for special cases.

3.     Window Hook and WinEvents are used to detect UI status change

4.     WM_NULL, probing processor usage and thread condition of target application are used for synchronization.


Even with the support of UI Automation Framework, creating good UI Automation case is still a skillful task. You need to know where to use raw device input API and where to leverage the COM invocation API. How you pickup the correct synchronization mechanism? How to probe the target UI element effectively, using index, ID, or name? Starting from the root desktop Window or from a parent window? What if the target application shows MessageBox whose parent is desktop not the application main window? What if the target application uses async way to creates dynamic UI?

 

If you can design a scalable web application, it means you know programming language, threading locks, HTTP protocol, and probably database very well. If you can design a good UI Automation case, the meaning is not limited to the test result you can get. It means:

 

1.    You know programming well, .NET Framework, multiple threading, C#...

2.    You know Windows mechanism (Win32) well

3.    You probably know how COM/DCOM works. There is no mystery for you any more.

4.    You know how application fails and you know debugging well. (Debugging UI Automation is quite difficult than debugging your own EXE, and I promise the Automation will fail again and again before you know how to work it out)

5.    At last, you know your test target well. Well, I will cover this in coming topics.

 

When the testers are armed with above skill sets, it will be quite easy to break other’s application. And it will be a solid tech base for them to develop other skills. When you know how to make your UI Automation work, you know how a typical developer works with code, how often the assumpations are incorrectly made, and why application fails so that you know how to break it.

 

 

PS:

If you know WPF Snoop, you see how DLL Injection can work for UI Automation. In prototype stage of my current project, I tried to create a new Automation Engine combing with DllInject and WCF. I inject a DLL into WFP application, and the DLL hosts a WCF server inside that. It opens a door for me to execute arbitrary code on target process. Then I can use WPF TreeWalker and reflection to do what ever I want.

Posted by lixiong | 2 Comments
Filed under: , ,

The hierarchy in flat memory: Heap and Stack

The hierarchy in flat memory: Heap and Stack

 

This section discusses Heap, related heap corruption/memory leak, and how to use pageheap to troubleshoot.

 

Heap is designed for efficiency use of the flat memory space

 

In Chinese version, it discusses why we need heap, and how it is built on the top of flat memory management. Then I go through different scenarios in detail. Here I just brief some of the points, and put the time to discuss the real cases.

 

Due to the nature of the heap, it pains when we meet the following issues:

 

1.      Heap use after free

2.      Heap use underflow and overflow

3.      Double free

4.      Multiple thread uses.

 

Pageheap is a OS built facility to enable debugging trace of heap manager. Please refer to:

 

How to use Pageheap.exe in Windows XP and Windows 2000

http://support.microsoft.com/kb/286470/en-us

 

Pageheap.exe download is available at:

 

http://www.heijoy.com/debugdoc/pageheap.zip

http://blogs.msdn.com/lixiong/attachment/2792912.ashx

 

A good resource is:

 

Debug Tutorial Part 3: The Heap

http://www.codeproject.com/debug/cdbntsd3.asp

 

Look at the following code, compile it in release mode:

 

char *p=(char*)malloc(1024);

    p[1024]=1;

 

It overwrites 1 byte. In release mode, it does not crash. However, if we enable pageheap with the following command:

 

C:\Debuggers\pageheap>pageheap /enable mytest.exe /full

 

C:\Debuggers\pageheap>pageheap

mytest.exe: page heap enabled with flags (full traces )

 

Rerun it with pageheap enabled, the application crashes. However, if we change the code a little:

 

char *p=(char*)malloc(1023);

    p[1023]=1;

 

Does it crashes even the pageheap is enabled?

 

It does not crashes even if pageheap is enabled with default setting. To debug such issue, we need to use /unaligned switch.

 

A similar case is the following code:

 

char *p=new char[1023];

    p[-1]='c';

 

To debug it, we need to use /backwards switch.

 

Let’s perform other tests on above code. If we compile in debug mode, even with pageheap enabled, do them crash? Based on my test, they do not crash no matter what switch we use. Do you know why?

 

It is due to CRT debug heap. The debug version of CRT allocates extra memory for trace use at the end of normal block. The extra 1 byte overwriting just occurs on the extra memory, thus the crashes does not happen. This is a case that debug version does not really help debug.

 

Another sample is double free. Let’s check the following code:

 

    char *p=(char*)malloc(1023);

    free(p);

    free(p);

 

Then try to test with the following conditions:

 

1.      Disable pageheap, test debug build and release build.

2.      Enable pageheap, test debug build and release build.

 

You should observe different behaviors. What’s the reason?

 

It is also due to debug CRT version. When CRT detects double free, it uses own way to report.

 

 

Besides heap corruption, another issue is heap fragmentation.

 

Heap fragmentation is often caused by one of the following two reasons

 

1. Small heap memory blocks that are leaked (allocated but never freed) over time

2. Mixing long lived small allocations with short lived long allocations

 

Both of these reasons can prevent the NT heap manager from using free memory efficiently since they are spread as small fragments that cannot be used as a single large allocation

 

For detailed info, please refer to:

The Windows XP Low Fragmentation Heap Algorithm Feature Is Available for Windows 2000

http://support.microsoft.com/?id=816542

 

For a vivid analysis, please refer to:

 

.NET Memory usage - A restaurant analogy

http://blogs.msdn.com/tess/archive/2006/09/06/742568.aspx

 

 

Another important use of pageheap is memory allocation trace. When enables trace function, heap manager records the callstack when heap operation occurs. It allows us to find out the recent callstacks of the heap operation when debugging heap issue. Look at the following sample:

 

char * getmem()

{

    return new char[100];

}

 

void free1(char *p)

{

    delete p;

}

 

void free2(char *p)

{

    delete [] p;

}

 

int main(int, char*)

{

    char *c=getmem();

    free1(c);

    free2(c);

    return 0;

}

Enable pageheap with trace, run the application in windbg:

 

0:000> g

 

 

===========================================================

VERIFIER STOP 00000007: pid 0x1324: block already freed

 

  015B1000 : Heap handle

  003F5858 : Heap block

  00000064 : Block size

  00000000 :

===========================================================

 

(1324.538): Break instruction exception - code 80000003 (first chance)

eax=00000000 ebx=015b1001 ecx=7c81b863 edx=0012fa7f esi=00000064 edi=00000000

eip=7c822583 esp=0012fbe8 ebp=0012fbf4 iopl=0         nv up ei pl nz na pe nc

cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202

ntdll!DbgBreakPoint:

7c822583 cc               int     3

 

With pageheap enabled, when heap manager detects issue, it triggers break point exception to stop the debugger. It also dumps detailed information in debugger such as block already freed. With kb command, we can list the callstack when the second free occurs:

 

0:000> kb

ChildEBP RetAddr  Args to Child             

0012fbe4 7c85079b 015b1000 0012fc94 0012fc70 ntdll!DbgBreakPoint

0012fbf4 7c87204b 00000007 7c8722f8 015b1000 ntdll!RtlpPageHeapStop+0x72

0012fc70 7c873305 015b1000 00000004 003f5858 ntdll!RtlpDphReportCorruptedBlock+0x11e

0012fca0 7c8734c3 015b1000 003f0000 01001002 ntdll!RtlpDphNormalHeapFree+0x32

0012fcf8 7c8766b9 015b0000 01001002 003f5858 ntdll!RtlpDebugPageHeapFree+0x146

0012fd60 7c860386 015b0000 01001002 003f5858 ntdll!RtlDebugFreeHeap+0x1ed

0012fe38 7c81d77d 015b0000 01001002 003f5858 ntdll!RtlFreeHeapSlowly+0x37

0012ff1c 78134c3b 015b0000 01001002 003f5858 ntdll!RtlFreeHeap+0x11a

0012ff68 00401016 003f5858 003f5858 00000064 MSVCR80!free+0xcd

0012ff7c 00401198 00000001 003f57e8 003f3628 win32!main+0x16 [d:\xiongli\today\win32\win32\win32.cpp @ 77]

0012ffc0 77e523cd 00000000 00000000 7ffde000 win32!__tmainCRTStartup+0x10f

0012fff0 00000000 004012e1 00000000 78746341 kernel32!BaseProcessStart+0x23

 

The return address is 00401016 , thus Free occurs in the previous line of 00401016 . The problematic heap address is 0x3f5858 , with !heap command, we can get the saved callstack of the recent heap operation:

 

0:000> !heap -p -a 0x3f5858

    address 003f5858 found in

    _HEAP @ 3f0000

   in HEAP_ENTRY: Size : Prev Flags - UserPtr UserSize - state

        3f5830: 0014 : N/A  [N/A] - 3f5858 (70) - (free DelayedFree)

        Trace: 004f

        7c860386 ntdll!RtlFreeHeapSlowly+0x00000037

        7c81d77d ntdll!RtlFreeHeap+0x0000011a

        78134c3b MSVCR80!free+0x000000cd

        401010 win32!main+0x00000010

        77e523cd kernel32!BaseProcessStart+0x00000023

 

Based on above saved callstack, at the previous line of 0x401010, a Free call already occurred. 00401016 and 00401010 nears, let’s check what they are:

 

0:000> uf 00401010

win32!main [d:\xiongli\today\win32\win32\win32.cpp @ 74]:

   74 00401000 56               push    esi

   75 00401001 6a64             push    0x64

   75 00401003 e824000000       call    win32!operator new[] (0040102c)

   75 00401008 8bf0             mov     esi,eax

   76 0040100a 56               push    esi

   76 0040100b e828000000       call    win32!operator delete (00401038)

   77 00401010 56               push    esi

   77 00401011 e81c000000       call    win32!operator delete[] (00401032)

   77 00401016 83c40c           add     esp,0xc

   78 00401019 33c0             xor     eax,eax

   78 0040101b 5e               pop     esi

   79 0040101c c3               ret

 

Based on above information, the double free is due to a delete call and a delete[] call. The corresponding source is in line 74. We can also check the 0x3f5858 address:

 

0:000> dd 0x3f5848

003f5848  7c88c580 0025a5f0 00412920 dcbaaaa9

003f5858  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f5868  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f5878  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f5888  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f5898  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f58a8  f0f0f0f0 f0f0f0f0 f0f0f0f0 f0f0f0f0

003f58b8  f0f0f0f0 a0a0a0a0 a0a0a0a0 00000000

 

Here the dcba is a flag. The address before the flag is used to save the callstack:

 

0:000> dds 00412920

00412920  00000000

00412924  00000001

00412928  0005004f

0041292c  7c860386 ntdll!RtlFreeHeapSlowly+0x37

00412930  7c81d77d ntdll!RtlFreeHeap+0x11a

00412934  78134c3b MSVCR80!free+0xcd

00412938  00401010 win32!main+0x10

0041293c  77e523cd kernel32!BaseProcessStart+0x23

 

This flag is useful for troubleshoot memory leak. The leaked memory is usually allocated in the same place. When there are a lot of leaked memories, there should be a lot of leaked heap pointers. Since every heap pointer contains the flag, by searching the flag, we can get corresponding callstack. If some callstack occurs frequently, usually the callstack is related to the leaked memory. The real case is:

 

Why out of memory when only 300MB memory is allocated.

 

The malloc call fails in customer application when total memory occupation is only 300MB. By checking the dump file, the issue is caused by heap fragmentation.

 

After enabling pageheap and captured the dump again, I used the following command to search dcba flag:

 

0:044> s -w 0 L?60030000      0xdcba

00115e9e  dcba 0000 0000 ef98 0012 893d 0047 efc8  ..........=.G...

19b90fe6  dcba cfe8 02d8 afe8 2ca3 cfe8 02d8 b22a  .........,....*.

19b92fe6  dcba cfe8 1a52 8fe8 1dff cfe8 1af6 f44f  ....R.........O.

19b9cfce  dcba efd0 23d8 cfd0 1c58 8fd0 15ac c0c0  .....#..X.......

2b06efe6  dcba cfe8 02d8 8fe8 258b cfe8 02d8 a6d2  .........%......

2b074fce  dcba 2fd0 1c0f afd0 1c4d dfd0 0e69 c0c0  .../....M...i...

2e860fe6  dcba afe8 02d8 2fe8 2ef3 afe8 02d8 0a0b  ......./........

2e868fce  dcba afd0 0881 2fd0 2e92 afd0 0881 c0c0  ......./........

 

Based on the search result, I use the following command to print the callstack randomly:

 

0:044> dds poi(19b92fe6  -6)

005bba0c  005cbe90

005bba10  00031c49

005bba14  00122ddb

005bba18  77fa8468 ntdll!RtlpDebugPageHeapAllocate+0x2f7

005bba1c  77faa27a ntdll!RtlDebugAllocateHeap+0x2d

005bba20  77f60e22 ntdll!RtlAllocateHeapSlowly+0x41

005bba24  77f46f5c ntdll!RtlAllocateHeap+0xe3a

005bba28  0046b404 Customer_App+0x6b404

005bba2c  0046b426 Customer_App+0x6b426

005bba30  00427612 Customer_App+0x27612

 

0:044> dds poi(19b9cfce  -6)

005bba0c  005cbe90

005bba10  00031c49

005bba14  00122ddb

005bba18  77fa8468 ntdll!RtlpDebugPageHeapAllocate+0x2f7

005bba1c  77faa27a ntdll!RtlDebugAllocateHeap+0x2d

005bba20  77f60e22 ntdll!RtlAllocateHeapSlowly+0x41

005bba24  77f46f5c ntdll!RtlAllocateHeap+0xe3a

005b8024  0046b404 Customer_App+0x6b404

005b8028  0046b426 Customer_App+0x6b426

005b802c  00427a82 Customer_App+0x27a82

 

0:044> dds poi(2b06efe6  -6)

005bba0c  005cbe90

005bba10  00031c49

005bba14  00122ddb

005bba18  77fa8468 ntdll!RtlpDebugPageHeapAllocate+0x2f7

005bba1c  77faa27a ntdll!RtlDebugAllocateHeap+0x2d

005bba20  77f60e22 ntdll!RtlAllocateHeapSlowly+0x41

005bba24  77f46f5c ntdll!RtlAllocateHeap+0xe3a

005bd5d4  0046b404 Customer_App+0x6b404

005bd5d8  0046b426 Customer_App+0x6b426

005bd5dc  00427612 Customer_App+0x27612

 

In normal condition, the callstack to allow memory should be random. However, above analysis shows that most of the heap pointers are allocated by the same callstack. The callstack is likely the root cause. By matching with PDB, I got the function name, and the customer confirmed the leak in that function.

 

 

Posted by lixiong | 1 Comments

Exception and Debug Event, the feedback from OS

Exception and Debug Event, the feedback from OS

 

This section will firstly brief exception related tech, and then use example to demonstrate how to use exception to troubleshoot effectively.

 

Exception Brief

 

Exception is a mechanism to control code’s execution flow. In normal situation, the code executes consequently, like the following:

 

*p=11;

printf(“%d”,*p);

 

It should print 11. But how about if p points to an invalid memory address? Then the line to assign value to *p will trigger access violation exception, and the following line to print may not execute any more.

 

For applications, if the behavior does not follow the expectation, exception is likely a direct cause because this is the most obvious and common way that changes the execution flow. In most cases, troubleshooting problem is just the same meaning of troubleshooting exception.

 

In original Chinese version, I discuss how the OS plays an important role on exception handling and dispatching. Also I brief how different programming language leverages the SEH to support the exception handing mechanism. I will skip such introduction here because the following two articles cover them all:

 

A Crash Course on the Depths of Win32™ Structured Exception Handling

http://www.microsoft.com/msj/0197/Exception/Exception.aspx

 

RaiseException

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/raiseexception.asp

 

Case study, how to let C++ dump the callstack as the C#

 

For the application created by C# or Java, when exception occurs, they are able to dump the call stack where the exception comes from. However, for C++, we have to use debugger to get the callstack. Now the customer wants to achieve stack dump in C++. Any good idea?

 

My solution is to use SEH, due to that local variable’s destructor will be executed during stack unwind when exception occurs. The sample code worked fine in VC6+Win2k3 platform. However, when I retry the sample, the same code behaves strangely in VC2005 + Win2k3 SP1. If I compile in debug mode, it works fine. However, in release mode, the application quits silently. For the whole story, I saved in my MSN blog (English), please refer to:

 

SEH,DEP, Compiler,FS:[0] and PE format

http://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!712.entry

 

 

Case study, Why Dr. Watson cannot save the dump file.

 

Problem Description:

 

The customer reports their VC application crashes randomly. To obtain detailed info, the customer registers Dr. Watson so that when exception occurs next time, we can get the dump file. However, when the problem reoccurs, Dr. Watson saves nothing.

 

Background Info:

 

In Chinese version, I provided brief info about what dump file is, and what info we can find in dump. Related info can be found at:

 

Description of the Dr. Watson for Windows (Drwtsn32.exe) Tool

http://support.microsoft.com/?id=308538

 

Specifying the Debugger for Unhandled User Mode Exceptions

http://support.microsoft.com/?id=121434

 

INFO: Choosing the Debugger That the System Will Spawn

http://support.microsoft.com/?id=103861

 

Generally speaking, by setting the AeDebug registry key, we can lunch the debugger when application crashes. If we choose Dr. Watson as the debugger, the default behavior is generating the dump file.

 

Problem Analysis:

 

Back to the case, the customer fails get the dump file, possible causes:

 

1.      The Dr. Watson’s bug. It works abnormally.

2.      The customer’s application does not crash, it just exits like calling ExitProcess.

 

To perform test against point 1, I provided the following sample code for testing:

 

int *p=0;

    *p=0;

 

With above code, Dr. Watson captured the dump file successfully on the customer side. So Dr. Watson works fine. It seems that the crash exclaimed by the customer is not really caused by unhandled exception. Maybe the customer calls ExitProcess unexpectedly. Thus during information capturing, we should not limited in unhandled exception. What we need to check is how the process disappears, maybe normal quit, maybe unhandled exception.

 

One possible way to figure out is to run the application in windbg. However, manual operation is troublesome. It would be nice if there is some automatic way. Windows provides a registry, which allows an application starts under debugger. With this setting, when the specified process starts, OS starts the debugger firstly, and pass in the target process and command line to debugger, then debugger starts the target process to debug. This option is very useful especially when we cannot start the process manually, like Windows Service, which starts ahead the user logon:

 

How to debug Windows services

http://support.microsoft.com/?kbid=824344

 

Some malicious software uses this way to attach silent process. This method is also called IFEO (Image File Execution Option) hijacking in China.

 

In windbg folder, there is a script called adplus.vbs. We can use it to launch windbg to obtain the dump file. Here we will use the script:

 

How to use ADPlus to troubleshoot "hangs" and "crashes"

http://support.microsoft.com/kb/286350/EN-US/

 

Use adplus /? to obtain detailed info.

 

The Actions:

 

With above analysis, the detailed actions are:

 

1.      In customer’s machine, create the key named by the problematic process under Image File Execution Options

2.      Under the key, create a string value called Debugger.

3.      Set the value to Debugger= C:\Debuggers\autodump.bat

4.      Edit the C:\Debuggers\autodump.bat as the following:

 

cscript.exe C:\Debuggers\adplus.vbs -crash -o C:\dumps -quiet -sc %1

 

Based on above setting, when the application starts, the OS launches cscript.exe to execute the adplus.vbs script. The –sc switch in adplus.vbs specify the target process path, -crash means we will monitor for application’s quit, -o specifies the dump output folder, -quiet disables prompt. We can use notepad.exe as test to check if dump is generated when notepad.exe quits.

 

Based on above setting, when the problem reoccurs, we get two dump files in c:\dumps folder, called:

 

PID-0__Spawned0__1st_chance_Process_Shut_Down__full_178C_DateTime_0928.dmp

PID-0__Spawned0__2nd_chance_CPlusPlusEH__full_178C_2006-06-21_DateTime_0928.dmp

 

Pay attention to the second filename. The name indicates the 2nd chance C++ exception does happen. Open the dump in windbg, check the callstack, it shows that the customer throws some C++ exception in code, but forgets to capture that. By adding corresponding catch block, the issue gets fixed.

 

The solution is nice, but why Dr. Watson cannot get the dump?

 

The Dr. Watson’s behavior still confuses me. Since it is unhandled exception, why Dr.Watson cannot capture the dump file? Firstly I created two different applications to double verify the behavior of Dr. Watson:

 

int _tmain(int argc, _TCHAR* argv[])

{

    throw 1;

    return 0;

}

int _tmain(int argc, _TCHAR* argv[])

{

    int *p=0;

    *p=0;

    return 0;

}

 

For the first one, Dr. Watson does not save the dump. For the second, Dr. Watson saves the dump. It looks like the behavior is related to the exception type.

 

Recall the detailed crash behavior for above two applications when the Auto key is set to 0 under AeDebug. On my side, the message boxes for crash are:

 

 

---------------------------

Microsoft Visual C++ Debug Library

---------------------------

Debug Error!

 

Program: d:\xiongli\today\exceptioninject\debug\exceptioninject.exe

 

This application has requested the Runtime to terminate it in an unusual way.

Please contact the application's support team for more information.

(Press Retry to debug the application)

---------------------------

Abort   Retry   Ignore  

---------------------------

 

 

---------------------------

exceptioninject.exe - Application Error

---------------------------

The instruction at "0x00411908" referenced memory at "0x00000000". The memory could not be "written".

 

Click on OK to terminate the program

Click on CANCEL to debug the program

---------------------------

OK   Cancel  

---------------------------

 

The behaviors are totally different! And the behavior is related to the compilation mode.

 

SetUnhandledExceptionFilter API is used to modify the default unhandled exception handler. Here, when C++ initialize the CRT, it passes CRT’s implementation (msvcrt!CxxUnhandledExceptionFilter). When unhandled exception occurs, the function checks the exception code. If it is a C++ exception, it shows up the first dialog, otherwise it bypass it to the default handler (ernel32!UnhandledExceptionFilter) provided by the OS. For the 1st situation, the callstack is:

 

USER32!MessageBoxA

MSVCR80D!__crtMessageBoxA

MSVCR80D!__crtMessageWindowA

MSVCR80D!_VCrtDbgReportA

MSVCR80D!_CrtDbgReportV

MSVCR80D!_CrtDbgReport

MSVCR80D!_NMSG_WRITE

MSVCR80D!abort

MSVCR80D!terminate

MSVCR80D!__CxxUnhandledExceptionFilter

kernel32!UnhandledExceptionFilter

MSVCR80D!_XcptFilter

 

For the second, it is

 

ntdll!KiFastSystemCallRet

ntdll!ZwRaiseHardError+0xc

kernel32!UnhandledExceptionFilter+0x4b4

release_crash!_XcptFilter+0x2e

release_crash!mainCRTStartup+0x1aa

release_crash!_except_handler3+0x61

ntdll!ExecuteHandler2+0x26

ntdll!ExecuteHandler+0x24

ntdll!KiUserExceptionDispatcher+0xe

release_crash!main+0x28

release_crash!mainCRTStartup+0x170

kernel32!BaseProcessStart+0x23

 

For detailed info, please refer to:

 

SetUnhandledExceptionFilter

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/setunhandledexceptionfilter.asp

 

UnhandledExceptionFilter

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/debug/base/unhandledexceptionfilter.asp

 

Does above analysis help explain the Dr. Watson’s behavior? To be honest, I do not think so. I think it is due to Dr. Watons’s special handling on different exception types. The detailed research can be found at:

 

http://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!1213.entry

 

Debug Event – communication between the OS and the debugger

 

Notification, also called Debug Event, it is a mechanism for OS to notify debugger when some thing happens. Similar as exception handing, OS dispatches the notification when some thing happens if the debugger is attached. Unlike exception, the notification can only be monitored by the debugger, not the target process. Also, there is no 1st chance and 2nd chance differences. In windbg’s help file, all the notifications are listed in the Controlling Exceptions and Events topic. Common notifications are DLL loading, unloading, thread creation and existing.

 

With exception and notification, we can capture the key for issue.

 

Case study, VB6’s version.

 

Customer’s VB6 application is not able to open data file created by Access 2003 in developer machine. It works fine for data created by Access 97. In other machines, both Access 2003 and 97 work fine.

 

They way to think is direct. Since it occurs in a specified machine, it means the issue is about the environment, not the code. Since it is about Access version, it should be related to the DAO’s version. By checking the modules loaded by the EXE, I found dao350.dll was loaded instead of dao360.dll. The next step is to figure out why dao350.dll gets loaded instead of dao360.

 

DAO is a COM component. It is likely created by COM API. A simple way is to trace the execution of the COM API like CoCreateInstanceEx with wt command, like I did in ShellExecute case. However, if we really try that, the wt command may execute for a whole day. It would be better if we can find a more workable way. Since we will trace until the library loading, why not set breakpoint at LoadLibrary to check how the dao350.dll gets loaded?

 

It is a very good way to set breakpoint on LoadLibrary because:

 

1.      DLL loading is not necessary through LoadLibrary. Native API like ntdll!LdrLoadDLL may load the module directly.

2.      If there are hundreds of DLLs to be loaded, breaking into LoadLibrary is troublesome, even if we can set conditional breakpoint to filder.

 

The better way is to leverage notification. During module load, OS sends notification to the debugger. In Windbg, we can use wide char to match and filter the DLL filename. It is easy to operate. Firstly, use “sxe ld:dao*.dll” command intercept the module load notification. When the filename is dao*.dll, the debugger breaks. (For windbg detailed usage, we will cover in next sections). The result in debugger is:

 

0:008> sxe ld:dao*.dll

 

ModLoad: 1b740000 1b7c8000   C:\Program Files\Common Files\Microsoft Shared\DAO\DAO360.DLL

eax=00000001 ebx=00000000 ecx=0013e301 edx=00000000 esi=7ffdf000 edi=20000000

eip=7c82ed54 esp=0013e300 ebp=0013e344 iopl=0         nv up ei pl zr na po nc

cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000246

ntdll!KiFastSystemCallRet:

7c82ed54 c3               ret

 

ntdll!KiFastSystemCallRet                            

ntdll!NtMapViewOfSection                             

ntdll!LdrpMapViewOfDllSection                        

ntdll!LdrpMapDll                                     

ntdll!LdrpLoadDll                                     

ntdll!LdrLoadDll                                     

0013e9c4 776ab4d0 0013ea40 00000000 00000008 kernel32!LoadLibraryExW

ole32!CClassCache::CDllPathEntry::LoadDll            

ole32!CClassCache::CDllPathEntry::Create_rl          

ole32!CClassCache::CClassEntry::CreateDllClassEntry_rl

ole32!CClassCache::GetClassObjectActivator           

ole32!CClassCache::GetClassObject                    

ole32!CServerContextActivator::GetClassObject        

ole32!ActivationPropertiesIn::DelegateGetClassObject 

ole32!CApartmentActivator::GetClassObject            

ole32!CProcessActivator::GCOCallback                 

ole32!CProcessActivator::AttemptActivation           

ole32!CProcessActivator::ActivateByContext           

ole32!CProcessActivator::GetClassObject              

ole32!ActivationPropertiesIn::DelegateGetClassObject 

ole32!CClientContextActivator::GetClassObject        

ole32!ActivationPropertiesIn::DelegateGetClassObject 

ole32!ICoGetClassObject                              

ole32!CComActivator::DoGetClassObject                

ole32!CoGetClassObject                               

VB6!VBCoGetClassObject                               

VB6!_DBErrCreateDao36DBEngine

 

By checking the parameter of the LoadLibraryExW, it shows

 

0:000> du 0013ea40

0013ea40  "C:\Program Files\Common Files\Mi"

0013ea80  "crosoft Shared\DAO\DAO360.DLL"

 

 

With above information, we see:

 

1.      DAO360 is not created by CoCreateInstanceEx. Instead it is created by CoGetClassObject. If we trace CoCreateInstanceEx, it wastes time.

2.      COM invocation starts from VB6!_DBErrCreateDao36DBEngine function. We should check the function in detail.

 

With previous DLL hell’s lesson, here the first thing is to check VB6.EXE’s version since the function resides in VB6. Compared with normal condition, the workable module version is 6.00.9782, while the problematic one is 6.00.8176. By installation of VS6 SP6, the issue gets fixed.

 

Discussions:

 

(In Chinese version, I discussed how to analysis the dump even if the dump is not captured at the first place when exception happened. I have to skip here.)

 

 

Exit proactively for unhandled exception

 

In some situation, the developer exits the application proactively when unhandled exception occurs, instead of waiting for the OS to terminate it. COM+, ASP.NET use this kind of tech. A Chinese C2C software called taobao wangwang (also named ali wangwang) uses this kind of tech too. The benefits are:

 

1.      We can define the UI for the crash.

2.      We can save the unhandled exception info for postpone analysis.

3.      To avoid the interference of the debugger, guarantee the immediate recycle, and try the necessary rescue operation like restarting the process.

 

It is easy to implement. One way is to use the __try and __except clause. The other way is to use SetUnhandledExceptionFilter API. For the study of taobao wangwang, please refer to:

 

http://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!817.entry (Chinese)

 

Based on my analysis, taobao uses SetUnhandledExceptionFilter to capture unhandled exception, and use MiniDumpWriteDump API to capture the dump proactively.

 

With this tech, the debugger is hard to get the dump for crash directly. Some additional configuration and windbg command is necessary

 

How To Obtain a Userdump When COM+ Failfasts

http://support.microsoft.com/?id=287643

 

How to find the faulting stack in a process dump file that COM+ obtains

http://support.microsoft.com/?id=317317

 

How to troubleshoot UnhandledExceptionFilter

 

Based on MSDN, UnahandledExceptionFilter will be invoked only if the debugger is not attached. Thus we can use UnahandledExceptionFilter to bypass the trace of debugger, to protect some sensitive code. To avoid debugger’s check, there are two ways at least:

 

1.      The target uses IsDebuggerPresent API to check if the debugger is attached. If so, it refuses to execute the sensitive code.

2.      Put the sensitive code to a function, and register the function as UnHandledExceptionFilter. To execute the sensitive code, just trigger an exception manually. Due to the design of exception handling, it avoids the debugger’s trace.

 

For the first way is easy to by pass. Look at the implementation of IsDebuggerPresent:

 

:000> uf kernel32!IsDebuggerPresent

kernel32!IsDebuggerPresent:

  281 77e64860 64a118000000     mov     eax,fs:[00000018]

  282 77e64866 8b4030           mov     eax,[eax+0x30]

  282 77e64869 0fb64002         movzx   eax,byte ptr [eax+0x2]

  283 77e6486d c3               ret

 

IsDebuggerPresent checks the flag in FS register. (FS:[18]]:30 saves PEB of current process). In debugger, we can change any of the register easily. Here we just need to change value of [[FS:[18]]:30]:2 to 0 to cheat IsDebuggerPresent to return false.

 

For the second way, changing [[FS:[18]]:30]:2 does not work because the judgment is based on the result of a kernel call. However, it does not mean impossible. Kwan Kyun Kim provides a way to cheat:

 

How to debug UnhandleExceptionHandler

http://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!1208.entry

 

Next I will discuss memory, including Heap, Stack, and the lovely heap corruption and pageheap.

 

 

Posted by lixiong | 1 Comments

Chapter Two, Assemnly code, Exception, Memory, Synchronization and Debugger

Chapter Two, Assemnly code, Exception, Memory, Synchronization and Debugger

 

(To save time, I will not translate the full Chinese content. I will skip some knowledge introduction)

 

This chapter covers usermode debugging related knowledge and tools, including assembly language, exception, memory layout, CRT, handle/CriticalSection/Thread Switch/Windbg…

 

It focuses on how to use the knowledge and tools for effective debugging. For the knowledge itself, the following two books cover them in detail:

 

Programming Applications for Microsoft Windows

Debugging Applications for Windows

 

The following sections use windbg to demo the debugging. If you want to setup windbg to follow the demo, please go to Section 4, which discusses windbg.

 

Section 1:Assembly code, the minimal unit for CPU execution

 

Assembly code is the minimal unit for CPU execution. In following situations, assembly code analysis is necessary:

 

1.      There is no problem in uplevel code, like C++ and C#. However, the execution result is incorrect. The compiler and CPU are suspicious.

2.      When the source code is out of hand. Like a Windows API behaves strangely.

3.      When application crashes, like Access Violation. The first hand information in debugger is the assembly op code that accesses memory.

 

Intel Architecture Manual volume 1,2,3

http://www.intel.com/design/pentium4/manuals/index_new.htm

 

Case Study, Analysis the VC compiler optimization

 

Problem Description:

 

The customer is developing a performance sensitive application. He wants to know how the VC compiler optimizes the following code:

 

nt hgt=4;

         int wid=7;

         for (i=0; i<hgt; i++)

             for (j=0; j<wid; j++)

                  A[i*wid+j] = exp(-(i*i+j*j));

 

The direct way is to check the binary code generated by the compiler. Here is my analysis. You may check on your side first, and compare with mine.

 

My analysis:

 

My analysis is based on VC6, default release mode settings. (The customer was using VC6 at that time. Now VC6 is out of the product life cycle and Microsoft does not support it any more).

 

int hgt=4;

int wid=7;

24:       for (i=0; i<hgt; i++)

0040107A   xor         ebp,ebp

0040107C   lea         edi,[esp+10h]

25:        for (j=0; j<wid; j++)

26:            A[i*wid+j] = exp(-(i*i+j*j));

00401080   mov         ebx,ebp

00401082   xor         esi,esi

// The result of i*i is saved in ebx

00401084   imul        ebx,ebp

00401087   mov         eax,esi

// Only one imul occurs in every inner loop (j*j)

00401089   imul        eax,esi    

// Use the saved i*i in ebx directly. !!Optimized!!

0040108C   add         eax,ebx 

0040108E   neg         eax              

00401090   push        eax

00401091   call        @ILT+0(exp) (00401005)

00401096   add         esp,4                   

// Save the result back to A[]. The addr of current offset in A[] is saved in edi

00401099   mov         dword ptr [edi],eax    

0040109B   inc         esi

// Simply add edi by 4. Does not calculate with i*wid. Imul is never used. !!Optimized!!

0040109C   add         edi,4

0040109F   cmp         esi,7

004010A2   jl          main+17h (00401087)

004010A4   inc         ebp

004010A5   cmp         ebp,4

004010A8   jl          main+10h (00401080)

 

The optimization is:

 

1.      Since i*i result does not change in inner loop, the compiler caches the result of i*i in ebx register.

2.      For A[i*wid+j] data fetch, only j changes in inner loop, and j increases one each time. Since A is an int array, the next fetch address increases 1*sizeof(int), which is 4. The compiler caches i*wid+j into EDI, and uses “add edi,4” to optimize the fetch address calculation.

3.      The compiler saves temp variable in register to avoid memory access.

 

Can you do a better job than the compiler by writing the assembly code manually?

 

Case Study: VC2003 compiler’s bug. C++ application works in debug, crashes in release

 

The compiler’s bug does exist. If you test the following code in VS2003, it crashes in release mode:

 

// The following code crashes/abnormal in release build when "whole program optimizations /GL"

// is set. The bug is fixed in VS2005

 

#include <string>

#pragma warning( push )

#pragma warning( disable : 4702 )  // unreachable code in <vector>

#include <vector>

#pragma warning( pop )

#include <algorithm>

#include <iostream>

 

//vcsig

// T = float, U = std::cstring

template <typename T, typename U>  T func_template( const U & u )

{

    std::cout<<u<<std::endl;

    const char* str=u.c_str();

    printf(str);

    return static_cast<T>(0);

}

 

void crash_in_release()

{

    std::vector<std::string>  vStr;

 

    vStr.push_back("1.0");

    vStr.push_back("0.0");

    vStr.push_back("4.4");

 

    std::vector<float>    vDest( vStr.size(), 0.0 );

 

    std::vector<std::string>::iterator _First=vStr.begin();

    std::vector<std::string>::iterator _Last=vStr.end();

    std::vector<float>::iterator _Dest=vDest.begin();

 

    std::transform( _First,_Last,_Dest, func_template<float,std::string> );  

 

     _First=vStr.begin();

     _Last=vStr.end();

     _Dest=vDest.begin();

 

 

    for (; _First != _Last; ++_First, ++_Dest)

         *_Dest =  func_template<float,std::string>(*_First);

}

 

 

int main(int, char*)

{

    getchar();

    crash_in_release();

    return 0;

}

 

The compiler settings are:

1.      Disable precompiler header.

2.      Use /O2 /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_MBCS" /FD /EHsc /ML /GS /Fo"Release/" /Fd"Release/vc70.pdb" /W4 /nologo /c /Wp64 /Zi /TP

 

Trace assembly execution:

 

Based on bellow initial analysis, the problem is likely a compiler bug:

 

1.      Besides cout and printf, there is no system related API call any more. All the operations are based on registers and memory. The environment and system factor are not involved here.

2.      The code looks good. If we adjust the place of std::transform, to put it behind the for loop, the problem does not occur any more.

3.      The compiler setting matters.

 

In the code, both std::transfor and the for loop do the same vector translation by calling func_template. Comparing the difference may help. I set breakpoint in the entry of main, and use ctrl_alt_D to go into assembly code mode. Trace the execution step by step, found:

 

In the implementation of STL in VS2003, the std::transform uses the following code to invoke the translator function:

 

*_Dest = _Func(*_First);

The compiler handles the call with the following:

 

EAX = 0012FEA8 EBX = 0037138C ECX = 003712BC EDX = 00371338 ESI = 00371338 EDI = 003712B0 EIP = 00402228 ESP = 0012FE70 EBP = 0012FEA8 EFL = 00000297

   388:            *_Dest = _Func(*_First);

00402228 push        esi 

00402229 call        dword ptr [esp+28h]

0040222D fstp        dword ptr [edi]

 

The parameter is saved in ESI, which is used to call func_template. As above, std::transform passes the parameter to func_template by pushing into stack.

 

While in the for loop, the compiler handles the
 
*_Dest =  func_template<float,std::string>(*_First);

With the following

 

EAX = 003712B0 EBX = 00371338 ECX = 003712BC EDX = 00000000 ESI = 00371338 EDI = 0037138C EIP = 00401242 ESP = 0012FE98 EBP = 003712B0 EFL = 00000297

    37:              *_Dest =  func_template<float,std::string>(*_First);

00401240 mov         ebx,esi

00401242 call        func_template <float,std::basic_string<char,std::char_traits<char>,std::allocator<char> > > (4021A0h)

00401247 fstp        dword ptr [ebp]

 

As above, in for loop, the parameter uses mov opcode to save the parameter into ebx, and then passes to func_template.

 

At last, let’s check how func_template handles the parameter passed in:

 

004021A0 push        esi 

004021A1 push        edi 

    16:      std::cout<<u<<std::endl;

004021A2 push        ebx 

004021A3 push        offset std::cout (414170h)

004021A8 call        std::operator<<<char,std::char_traits<char>,std::allocator<char> > (402280h)

 

Here it pushes ebx into stack, and then invoke std::cout, without reading the stack for parameter passed in. It means func_template, the callee believes the parameter should be passed from register. However, in transform function, the caller, the parapeter is passed by stack. This mismatched handling causes the crashes.

 

But why the problem only occurs in release mode, not debug mode? You can analysis by the same way.

 

Case Study, how the notorious DLL Hell causes Server Unavailable for ASP.NET

 

The ASP.NET web site reports Server Unavailable for any page request. When the w3wp.exe host process starts, it crashes immediately when handling the request. With further analysis in debugger, the source for the crashes is a null pointer access. However, from the callstack, every module forms from w3wp.exe and .NET Framework. It does not reach the customer’s code. By checking with relative code, the null pointer is passed from the caller to the callee. The problem occurs when the callee uses the parameter. Obviously, next we should check why the caller does not pass correct parameter to the callee.

 

The strange thing is, the pointer in caller’s function is initialized already. It is a valid pointer in the caller just before the call function. It is also pushed to the stack correctly. Why the caller gets an invalid pointer by fetching it from the stack? With single step trace with the execution, the problem is, the caller puts the parameter in callee’s [ebp+8], while the callee fetches the parameter from [ebp+c]. Is it similar with the previous case? The murderer is not compiler this time, it is the DLL version. The caller and callee resides in two different DLLs. The caller’s DLL version is .NET Framework 1.1 version, while the callee’s DLL version is .NET Framework 1.1 SP1. In CLR1.1, the callee accepts 4 parameters, while the new version changes it to accept 5 parameters. The caller passes the parameter as the old style, while the callee fetches it with new style. A typical DLL hell problem. After reinstalling the CLR 1.1 SP1 to match the DLL version, and reboot, it works fine.

 

Why the DLL files versions is different, possible causes:

 

1.      After the .NET Framework 1.1 SP1 is installed, the customer does not reboot it. Some DLL in use is locked and pending to update after the reboot.

2.      The web server is a node of Application Center cluster. The cluster sever will sync the file versions in all the nodes. If we just apply .NET Framework 1.1 SP1 in a single node, the file version may be rolled back by the application center:

 

PRB: Application Center Cluster Members Are Automatically Synchronized After Rebooting
http://support.microsoft.com/kb/282278/en-us

 

Discussion:

 

Release compilation is always faster than debugging?

 

Run the following code in release and debug mode, compare the performance. The debug version will be faster than release version. Why?

 

long nSize = 200;

         char* pSource = (char *)malloc(nSize+1);

         char* pDest = (char *)malloc(nSize+1);

         memset(pSource, 'a', nSize);

         pSource[nSize] = '\0';

         DWORD dwStart = GetTickCount();

         for(int i=0; i<5000000; i++)

         {

             strcpy(pDest, pSource);

         }

         DWORD dwEnd = GetTickCount();

         printf("%d", dwEnd-dwStart);

 

Write your own strcpy function, compare the performance with the default version shipped in VS. Can you do faster than the default version?

 

In above samples, the decisive factors are:

 

1.      In 32 bits chips, try to move a DWORD once instead of 4 times byte move. Pay attention to the align of the 4bytes boundary.

2.      Here the strcpy is called many times. Using inline version reduces the high cost of the call opcode.

 

 Next I will discuss section2, exception handling.

 

Posted by lixiong | 0 Comments

How the Chinese UI in SharePoint turns to English randomly

How the Chinese UI in SharePoint turns to English randomly.

 

Problem Description:

 

Windows SharePoint Portal is a web application that runs on .NET Framework, based on ASP.NET. The Administrator is able to configure the UI language, like English or Chinese. Someday a customer complained that their SharePoint could not show Chinese any more. The detailed behaviors were:

 

1.      The configuration is set to Chinese.

2.      The application works fine in the past year.

3.      In most cases, the Chinese UI shows fine.

4.      Occasionally, the UI shows in English.

5.      It will turn back to Chinese automatically after some time.

6.      If we restart IIS service when the UI is in English, it may turn back immediatlly.

 

Troubleshooting:

 

After basic check, the configurations are fine. With some initial debugging I found:

 

1.      When the problem occurs, the customer’s worker thread UI culture is zh-cn, which is under the expectation.

2.      The implementation for the UI language is based on the ResourceManager.GetString method:
ResourceManager.GetString
http://msdn2.microsoft.com/en-us/system.resources.resourcemanager.getstring.aspx

3.      The satellites assemblies for UI are installed correctly.

 

Based on above information, the GetString CLR API call should return the Chinese resource. However, when the problem occurs, it returns English resource. Thus, the problem turns out to be, why the CLR API call behaves incorrectly. The difficult part is:

 

1.      We cannot reproduce the problem simplify and consistently.

2.      The problem occurs in the customer’s production server only. In test server, it does not occur. We need remote debugging on the production server.

3.      The problem is caused by the CLR code, not the customer’s code. We cannot add trace to save execution log to check like what we did in the 3rd case.

 

At the same time, there were several similar issues reported in Europe. In Japan, it was more sericious and more than 10 sites got the same issue without solution for several weeks.

 

Since the only clue is GetString method, the troubleshooting plan is straight forward: trace the execution of the GetString method. Before jumping into remote debugging, firstly I performed research on the logic of GetString method:

 

1.      When the UI resources are loaded from satellites assemblies, they are saved in an internal HashTable.

2.      The GetString checks if the resource requested is in the HashTable, if it is, go to step 5.

3.      If it is not there, GetString asks ResourceManager to load the resource from satellites assemblies based on current thread UI Culture

4.      ResourceManager reads the resources from satellites assemblies and saved into the HashTable

5.      GetString returns the resource from the HashTable for the request

6.      If any problem occurs in above steps, like Satellites assemblies were not found, it falls back to use UI neutral (English by default) resource.

7.      When the HashTable is initialized and resources are loaded, the ResourceManager is out of the story. GetString always returns the resource found in HashTable directly.

 

Based on above logic, it is likely that the ResourceManager fails to load the satellite assemblies. So it uses neutral resource instead. Based on this info, I used Filmon to check the file operations on the satellites files. However, Filemon shows all the file operations on resource file succeeded.

 

Since I tried everything I could think of, I had to use remote debugging to check. After the communication with customer, they allowed me to connect to the production server between 10 pm to 8 am in the next morning. To avoid the obstruction from client request and our test, I used the following steps:

 

1.      In IIS setting, I restrict the client IP for our test client only.

2.      Set the timeout to infinite to avoid timeout when debugger hangs the server.

3.      Restart IIS every time before debugging.

4.      Use the client browser to refresh the page, and trace the execution in debugger.

 

IP restriction is used to avoid obstruction from other clients. Restarting IIS is necessary because the HashTable initializes only once per life time. After several days hard working, found:

 

1.      It fails when loading the satellites assemblies file.

2.      During the satellites loading, there are two first chance CLR exception.

3.      When the problem occurs, there are some string codepath executed. In local test, we do not see such codepath.

 

Based on above results, I verified the code in detail specifically. Combing with the wt command and exception monitoring, I captured some dump files during the strange codepath. After research, I found an important information: the customer installed .NET Framework 1.1 SP1. These codepath were new in SP1.

 

The full story is the following. In CLR 1.1, ResourceManager starts to probe the satellites assembly after checking the UI culture for current thread. In SP1, a new feature is added. Besides the UI culture setting, we can also use configuration file to guide the probing as well. For ASP.NET and Sharing Point, the configuration file is web.config. However, the web.config file is very important, and normal anonymous web user does not get the privilege to open it. When the code tries to check the settings in web.config, if the access denied, an CLR exception will be thrown. Due to the exception, the ResourceManager believes that something goes wrong in resource probing, and it falls back immediately to use neutral resource, hence the English shows instead of Chinese.

 

In other prospective, the developer did not expect that most users are now allowed to tough the application configuration file. (web.config here) in web applications. The web situation is not handled specially, thus it breaks a lot of web applications.

 

After the user restarts IIS, if the first request comes from an authorized user, who has the privilege to access the web.config file, the Chinese resource will be loaded fine and the problem does not occur. If the first uses cannot open the web.config file, the problem happens, and it keeps there until IIS restarting because the HashTable only initializes once per life. For the problem, the workaround is to set web.config for Everyone Read Allow. This workaround solves the SPS urgent issue globally.

 

Modifying the access privilege for web.config may cause security issue. Asking the developer to fix and applying a hotfix is the final solution:

 

FIX: Your application cannot load resources from a satellite assembly if the impersonated user account does not have permissions to access the application .config file in the .NET Framework 1.1 Service Pack 1

http://support.microsoft.com/?id=894092

 

Lesson:

 

I reviewed the troubleshooting history after the solution. In fact, I was quite near to the solution in my initial checks. I used Filemon to check the satellite assemblies file operations, but I did not go further to check for the other important files. I opened the Filemon result, search Access Denied, got more than 30 plances, all in web.config file. Meanwhile, I should have awarded that some recent released patch was suspicious because the problem did not occur in past 1 year, but flooded in several weeks globally.

 

Clear thought, good analysis and effective debugging is very good. However, if we can think out of the box sometime, we can do better.

 

End for the chapter.

 

The 1st chapter demonstrated 4 cases. When faceing with a problem next, if you think a bit more before action, you got what I want to say. There is no troubleshooting template. Instead, it matters about the way of think, how you use your brain, and some of the luck.

 

Next chapter, I will start to discuss some important knowledge for effective troubleshooting, including exception handling, memory, windbg, and……

 

 

Posted by lixiong | 1 Comments

A boring, but typical, difficult case ---- ASP.NET session lost

A boring, but typical, difficult case, ASP.NET session lost

 

Problem Description

 

It was an large ASP.NET project. In testing environment, everything is fine. In production environment, when the load went up, an unhandled NullReferenceException occurred randomly. Based on the analysis on code, it occurred when trying to access an Session Object, which should be already set somewhere. The issue only occurred 3 times in the past half year, and in 3 different pages.

 

The Strategy

 

If the problem occurs a bit frequently, we may get a chance to gather some common pattern, like some special user action, or special data input. But it occurred only 3 times, almost no chance for further test. We must make very comprehensive and solid action plan so that we captured enough info when the problem occurs again.

 

The way to think is quite straight forward. Learn the detail about session. Summarize all the possibilities and capture info accordingly.

 

For the Session in detail, please refer to:

 

Underpinnings of the Session State Implementation in ASP.NET

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnaspp/html/ASPNetSessionState.asp

 

With full understanding of ASP.NET session, the issue should fall into one of the following possibilities:

 

1.      It is a load-balance environment, but we are using In-Proc session, or the machine-keys are not synced in these services. The customer confirmed there was only one server, with In-Proc session. Thus this is not the case.

2.      The simplest situation is that all the users’ session get lost when the problem happens. Such situation is usually caused by appDomain recycle or IIS crashes. We can check performance monitor (appDomain Restart counter) or Event log (for crash) to check.

3.      It is a bit difficult if the problem only impacts some user(s), not all. To verify with log, in Session_Start event, we can save the session creation time and current session id into a local log file. Meanwhile, we will save another test value into the session. When the problem occurs, in the ASP.NET global error handling function, we get the session id for current user, and compare with the local log. If the session is just created, it is likely a client side problem like cookie lost.

4.      If the session is not just created, then let’s check if the test value we saved inSession_Start is lost either. If the test value gets lost as well, we are sure that some code in server side clears the whole session, probably Session.Clear.

5.      If the session is not just created, and the test value is still there, the issue should be caused by the last modification on the null session value which causes the NullReferenceException. In such situation, we have to log all the Session operations on that session object in server side to trace.

 

As above, the different behaviors are tightly bound to different potential causes. As a summary, we need to figure which situation below it belongs to:

 

1.      All the session values get lost for all the users.

2.      All the session values get lost for a specified user.

3.      Some session values get lost for a specified user, while other session values are still there.

 

Then we can decide the next step for further narrow down, like:

 

1.      Check appDomain Recycle or Crash for situation 1.

2.      Checking session creation time for situation 2.

3.      Search Session.Clear function call in all the code, or use debugger to set condition breakpoint during execution for situation 3.

 

Action and Root Cause

 

Based on above analysis, we started working on the code to implement the log mechanism. The customer also added further log to trace some important function invocations.

 

After one week waiting, the problem gets reproduced. In log file, we found:

 

1.      Some session values get lost for a specified user, while other session values are still there

2.      Based on the session creation time, the session was alive for a long while.

3.      From the execution trace in log file, the problematic session value is indeed cleared by server side code. The design is, after a workflow cycle, sessions are cleared in a central place. After that, the page will be redirected to a special page for initialization job for next workflow cycle. The log shows that, there is only clearance job to clear the session, no initialization job to reset the session.

 

With above information, we started careful code review for the joint place. I found the redirection is performed by client side javascript. It is a typical fault for web application. Since javascript depends on client browser behavior very much, it is better to use http 302 for redirection instead of client script. It is also a must to use server side code to verify everything that relies on client.

 

HTTP Status code definition

http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

 

With above change, the problem gets resolved. Session lost is not an interesting case, but it is very typical:

 

1.      The issue behaves simple, straight forward, but hard to reproduce.

2.      The root cause resides quite earlier than the problem behavior. We cannot get very useful info if we only check when the problem appears out. We have to collection information proactively.

3.      Fancy conditional breakpoint in windbg are cool, but log is the last reliable friend for really tough cases. Deep understanding of the product and carefully analysis on the problem help effective log.

 

 

Next I will discuss, How the Chinese UI in SharePoint turns to English randomly.

 

Posted by lixiong | 0 Comments

How a single ShellExecute call opens a TXT file, with another GIF open unexpectedly

Problem Description:

 

A customer called in, complained angrily. “A call to ShellExecute API, passing in a local txt file path, it opens a GIF file occasionally with the txt file! Definitely!” The customer was sure that the parameter was correct, and the return value for ShellExecute indicated the call succeeded.

 

Initial thinking

 

I thought it for 2 mins, told the customer.” Impossible, if you are doing right”. The only possible cause is the parameter is a batch file, or the system is cracked. And even if this is the case, the behavior should be consistent, not occasionally.

 

I was wrong. After checking the screen-shot, and reproduced the problem locally with the project received from the customer, I have to believe a single call to ShellExecute opens two files sometime.

 

In below analysis, you will find out how it occurs. But the important thing here is, believe the truth, not the experience. If I told the customer that ShellExecute API came from NT4, so robust, the only possibility was some anti-Virus software, I would loss the chance to find out the truth.

 

Go deeper:

 

The application is a traditional MFC Dialog Application. There is an HTMLView, which renders a local HTML file. There is an IMG tag with a local GIF in the HTML. When right clicking on the GIF, customer-defined context menus shows. When a menu item is clicked, the menu item Message Handler calls ShellExecute to open a local TXT file. The TXT extension is bound to UltraEdit, thus the file will be opened by UltraEdit. When the problem occurs, UltraEdit opens TXT file, but it also opens an additional file in binary mode. The additional file is the GIF file where we click the right mouse.

 

The customer uses the following code to pop up the context menu, which replaces the default WebControl context menu:

 

if(pMsg->message==WM_RBUTTONDOWN)

{CMenu menu;

         menu.LoadMenu(IDR_MENU_MSG_OPEN);

         CMenu *pMenu = menu.GetSubMenu(0);

         if (pMenu)

         {

             CPoint pt;

             GetCursorPos(&pt);

             pMenu->TrackPopupMenu(TPM_LEFTALIGN, pt.x, pt.y, this);

         }

         return FALSE;}

 

Ok, the additional GIF is related to the HTMLView, but what’s the steps to troubleshoot?

 

I thought the problem like this. Since ShellExecute opens the GIF file, it means somehow the GIF is related to the ShellExecute call. The ShellExecute should know the GIF file’s path, otherwise how it opens that. Some action plans I can try:

 

1.      Use Windbg, set bp on ShellExecute to debug. Checking the parameters and step-in to trace how the GIF gets opened.

2.      Perform further research on the customer’s code to narrow down the problem.

 

With some try, I gave up the 1st way. First, the file open is done in UltraEdit, not ShellExecute. It is crazy to set bp on both ShellExecute and UltraEdit. Second, I cannot reproduce the issue consistently.

 

Think about the code. First, ShellExecute is called in menu event (WM_COMMAND) handler function. Let’s try to isolate the relationship between ShellExecute and Windows Message. I created a Timer. In Timer function, I invoke the menu event handler directly. Based on such test, the issue does not occur any more. Thus I know the problem should be related to the context menu. So the next is to check the context menu related code, the PreTranslateMessage.

 

The customer uses PreTranslateMessage to intercept the WM_RBUTTONDOWN message, and then shows the context menu. Pay attention, the return value is False!!! MSND describes the function like the following:

 

CWnd::PreTranslateMessage

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/_mfc_cwnd.3a3a.pretranslatemessage.asp

 

False means the message should be considered not handled, and should be postpone to next handling chain. While the real case is, the customer’s code handles the message already to show the context menu. Thus I changed the return value to True, and the issue stops.

 

Never ignore any tiny evidence and resource you have. Try to figure out the cause and result. When the clue is identified, stick to it.

 

The real problem just begins.

 

I am lucky. It did not take me a lot of time and I got the solution. However, it is too early to stop. Changing from False to True causes new problem, and above analysis does not convince me why a single ShellExecute opens 2 files.

 

The new problem is, in the menu item handler, the customer uses some HTMLDocument properties to check which HTML tag the customer operates on, like <img> tag or <div> tag. After changing the return value to True, above function does not work any more. It is easy to explain. Since HTMLView does not continue to handle mouse message, it is reasonable that customer action is not visiable in HTMLView.

 

The confusing part is still about the two open files. PreTranslateMessage’s different return values cause two different behaviors. A good way to analysis is to compare the executions. First I diassembled ShellExecute, but found it was too complex to follow. A better way is to use wt command in Windbg to monitor the ShellExecute code-flow and callstack. After some research, I found ShellExecute depends on DDE to open the target file. The DDE depends on Windows message. So I set conditional breakpoint on PostMessageW/SendMessageW. Every time the two functions trigger, I print the message parameters and callstacks in windbg.

 

At last, I got the callstack when the problem occurs. Are you able to figure out the root cause from below callstack?

 

USER32!PostMessageW 

ole32!CDropTarget::Drop

ole32!CDragOperation::CompleteDrop

ole32!DoDragDrop    

mshtml!CLayout::DoDrag

mshtml!CElement::DragElement           

mshtml!CImgElement::HandleCaptureMessageForImage

mshtml!CElementCapture::CallCaptureFunction

mshtml!CDoc::PumpMessage               

mshtml!CDoc::OnMouseMessage            

mshtml!CDoc::OnWindowMessage           

mshtml!CServer::WndProc                

USER32!InternalCallWinProc             

USER32!UserCallWinProcCheckWow         

USER32!DispatchMessageWorker           

USER32!DispatchMessageW                

SHELL32!SHProcessMessagesUntilEventEx  

SHELL32!CShellExecute::_PostDDEExecute 

SHELL32!CShellExecute::_DDEExecute     

SHELL32!CShellExecute::_TryExecDDE     

SHELL32!CShellExecute::_TryInvokeApplication

SHELL32!CShellExecute::ExecuteNormal   

SHELL32!ShellExecuteNormal             

SHELL32!ShellExecuteExW                

SHELL32!ShellExecuteExA                

SHELL32!ShellExecuteA 

 

From the CDropTarget::Drop function, we can guess how the gif gets opened. The strange thing is how the ShellExecute involves mshtml? If we recall the problem, we missed a very very important clue:

 

Since the PreTranslateMessage return false, which means the message should flow to HTMLView, why the IE default context menu does not show up!

 

Based on the callstack, the story is:

 

1.      The user click the right mouse button down (not release), and the system generates WM_RBUTTONDOWN windows message.

2.      In PreTranslateMessage, the code pops up the context menu, and the application blocks in the TrackPopMenu function call.

3.      The TrackPopMenu displays the menu, waiting for the customer’s choose.

4.      The customer releases right button and click menu item.

5.      After menu item is clicked, TrackPopMenu call returns, and the system generates WM_COMMAND in posted message queue.

6.      PreTranslateMessage return false, thus WM_RBUTTON flows to HTMLView. Here the system generates an additional WM_MOUSEMOVE message to sent message queue.

7.      The WM_RBUTTONDOWN message on HTMLView control brings the same effect as the customer clicks the GIF but not release. The IE default context menu shows when receiving WM_RBUTTONUP. Since the pop up menu blocks WM_RBUTTONUP message, the IE context menu does not show up.

8.      The application returns back to message pump. The Sent Message prioritize Posted Message, thus WM_COMMAND gets executed firstly. ShellExecute gets called.

9.      ShellExecute opens the TXT file and brings the UltraEdit window to foreground.

10.   ShellExecute uses DDE for postpone communication with UltraEdit, like querying the result. Since DDE depends on Windows Message, and ShellExecute is an UI independent API, the ShellExecute has to maintain dedicated message pump. This is what SHProcessMessagesUntilEventEx does.

11.   Due to the message pump in ShellExecute, WM_MOUSEMOVE in step6 gets dispatched. From the callstack, the message(s) go(es) to HTMLView.

12.   As described in step7, the GIF is in captured status due to the lack of WM_RBUTTONUP message. The foreground window is UltraEdit, and the WM_MOUSEMOVE message comes. These unexpected messages cause HTMLView generates Drag operation incorrectly, just like using mouse to drop the GIF from HTMLView to UltraEdit. GIF gets opened.

 

There is another question. Drag involves three steps, mouse down, mouse move and mouse up. Here we do not see the source of mouse move. The WebControl implementation is too complex here and I did not go further. I just guess, due to this uncertainness, the problem occurs randomly. Hope you can make the story 100% complete J

 

In step6, I am not sure why an additional WM_MOUSEMOVE message is generated, but this is how it works in Windows XP.  Anyone can tell me why we have this design?

 

/****************************************************************

Answer comes from theoldnewthing again, see the 1st comment. Put it here 

http://blogs.msdn.com/oldnewthing/archive/2003/10/01/55108.aspx

*****************************************************************/

 

 

The problem gets reproduced in IE6. Due to some change in IE7, we may not be able to reproduce with IE7.

 

I feel I am a binary machine, Windows + MFC, executes the code strictly, responses to the Windows message, even this case is full of randomness. Do not blind faith, respect the CPU and use the binary way to think. A single bit, true or false, is the main clue for this case.

 

The case gets solved. With root cause identified, it is easy to find the right solution to show the context menu in WebBrowser control:

 

How to disable the default pop-up menu for CHtmlView in Visual C++

http://support.microsoft.com/?id=236312

 

Lessons:

 

1.      Trust evidence, not experience.

2.      Be sensitive for any tiny clue.

3.      Use comparing for effective analysis.

4.      Use the binary way to think and understand computer program.

 

Further Discussion

 

MSDN is the most trustable resource?

 

Not only the experience, but also MSDN, is less trustable than evidence. After careful testing, identified there is a bug in MSDN and requested a knowledge base article to explain:

 

Description of a documentation error in the "Assembly.Load Method (Byte[])" topic in the .NET Framework Class Library online documentation

http://support.microsoft.com/kb/915589/en-us

 

The first line is:

 

“The "Assembly.Load Method (Byte[])" topic in the Microsoft .NET Framework Class Library online documentation contains an error

 

You dare to say the CPU is broken?

 

This case was handled by a US CPR in COM+ team. There were two load-balanced main frame PC, with 64 CPU each. The software and hardware configurations are the same. One PC works fine while the MSDTC crashes in the other one occasionally. Based on the engineer’s debugging, he said to customer, “I suspect the CPU #17 is broken, switch to a new one as a try”. The customer then involved Intel engineer, skeptically. After hot switch, the problem does not occur any more. I am thinking, if someday I see eax is non-zero after the execution of xor eax,eax, dare I say the CPU is broken? Look:

 

There's an awful lot of overclocking out there

http://blogs.msdn.com/oldnewthing/archive/2005/04/12/407562.aspx

 

Meanwhile, some rootkit is able to cheat debugger now. And sometimes, some new OS feature causes some interesting behavior:

 

VS 2003 crashes when pushing EDI: (Chinese)

http://eparg.spaces.msn.com/Blog/cns!1pnPgEC6RF6WtiSBWIHdc5qQ!379.entry

 

SEH,DEP, Compiler,FS:[0], LOAD_CONFIG and PE format

http://eparg.spaces.msn.com/blog/cns!59BFC22C0E7E1A76!712.entry

 

It is not easy to figure out the truth and the lie…

 

DWORD and file length:

 

Keep sensitive on some digits helps clue finding. The customer created a file class. The class causes strange behavior when the file length exceeds 4GB. Guess the cause immediately!

 

4GB is the maximum of DWORD. If the customer uses DWORD to represent the length, then…

 

To solve the issue, the customer should use GetFileSizeEx, not GetFileSize. In fact, this is not a “stupid” error that for entry level programmers. Look at the following kb:

 

FIX: Generation 1 garbage collections and generation 2 garbage collections occur much more frequently on computers that have 4 GB or more of physical memory in the .NET Framework 1.1

http://support.microsoft.com/kb/893360/en-us

 

 

(there are still some content in original Chinese paper. Lack of time, skip here)

 

 

Next, I will discuss a boring, but typical, difficult case, ASP.NET session lost.

Posted by lixiong | 2 Comments

Hopeless performance. The ADO.NET 2.0 is slower than ADO 1.1

1.0            Warm up.

 

Firstly my mentor shared the following question with me:

 

Why the image gets reversed in mirror horizontally, but vertically.

 

I asked many people about this question, but seldom gave me the accurate answer in 3 mins. Some interesting answers here:

 

1.      People’s eye is horizontal not vertical. (Also cited in some “interview bible”)

2.      If we change the mirror by 90 degree, then the result changes

3.      We live in the Northern Hemisphere

 

From the tech part, the knowledge involved here is only mirror reflection, far simple than Windows memory management. This case just shows, besides the knowledge itself, problem solving needs the way of clear thinking.

 

1.1 Hopeless performance. The ADO.NET 2.0 is slower than ADO 1.1!

 

Problem Description:

 

Based on the following article, the customer decides to upgrade to CLR2 for performance:

 

DataSet and DataTable in ADO.NET 2.0

http://msdn.microsoft.com/msdnmag/issues/05/11/DataPoints/default.aspx

 

But based the customer’s test, the performance get worse with CLR2.

 

Look at the code, quite simple:

 

         OracleConnection conn = new OracleConnection();

         conn.ConnectionString = "...";        

         conn.Open();

         OracleCommand cmd = new OracleCommand();

         cmd.Connection = conn;

         OracleDataAdapter dap = new SOracleDataAdapter("select * from mytesttable", conn);           

         DataTable dt = new DataTable();

         DateTime start = System.DateTime.Now;

         dap.Fill(dt);

         TimeSpan span = DateTime.Now - start;      

         conn.Close();

         Console.WriteLine(span.ToString());

         Console.WriteLine("The Columns.Count is" +dt.Columns.Count.ToString());

         Console.WriteLine("The Rows.Count is"+dt.Rows.Count.ToString());

The data used for test is quite simple either. 250000 rows, 4 fields. By checking the result of span.ToString, ADO.NET 2 is 1 time slower than ADO1.1. dap.Fill method takes 6 seconds in CLR2, comparing with 3 seconds in CLR1.

 

Hopeless

 

How to handle it?

 

When I finished the test locally, my feeing is hopeless only. Look at what I could do:

 

1.      The backend datatable is quite simple. 4 fields are int, no index, no primiary key, no foreigen key. In other words, the problem does not relate to data schema, purely client side problem.

2.      The code is very simple. There is nothing we could further fine tune.

3.      The test machines gets both CLR1 and CLR2 installed. The test and comparing are done in the same machine. The hardware and software environment is the same. The only difference is the CLR version, which is the customer’s concern.

 

Hopeless, no way to try. I have to confess that “No idea. ADO.NET 2 dev did a bad thing! Sorry”. And also say “BTW, the MSDN paper is based on CLR2 beta2, not accurate.”

 

Change the Role

 

The good thing is that I did not give up my trying. After careful research, I found:

 

1.      The sample code in MSDN does do a better than CLR 1.1. However, the schema of the data is a bit complex, while the customers is simple

2.      The performance is worse in above sample. However, in other cases, the ADO.NET 2.0 does a better job to gain 10 times better result in the same customer’s project.

 

Meanwhile, when discussing with my colleague, they do not care about the percentage of how worse it is. They concern about why the customer wants to operate on 250000 rows of data at the same time. After careful discussion, got the following points:

 

1.      The performance is 3 seconds worse with 250000 rows, 8 microseconds per row. From the tech part, db reading operation needs to send request to DB client engine firstly. The DB client engine pushes the request to DB server, and then fill the result back to client application. In the whole chain, any little change is likely to be enlarged, based on butterfly effect. In other words, if we look at percentage, the performance is indeed 100% worse. However, if we look at the delay of seconds, and compare with the data amount, we get further understanding on this problem.

2.      The customer’s code reads 250000 rows at the same time. What kind of situation that the dev wants to write code like this? Such kind of code is likely to run in performance sensitive situation? Let’s cacuate, 250000 rows* 4 fileds*20 bytes each=19MB data. If each request involves 3 seconds delay, plus 20MB memory cost, and heavy network operations, the design is bad. In normal condition, such kind of code should be used to initialize some global data, which may be put in cache. And this kind of code should only run once per application’s life. In other words, the performance hit in real situation is really visible?

3.      We did not evaluate all the data operations. The customer only tests the performance of fetching data. We do not consider how to use the data later. If ADO.NET 2 sacrifice the performance on loading, to benefit the performance in other places like iteration or deleting, we gain better overall performance.

 

Based on above analysis, I decide to ask the customer what the real situation it is. Meanwhile, I used the following method to seek the root cause:

 

Troubleshooting:

 

First I use reflector to check the implementation on DataAdapter.Fill. (I am not CPR and I cannot see the source code)

 

Reflector for .NET

http://www.aisto.com/roeder/dotnet/

 

I found the DataAdapter.Fill method can be divided in the following two parts:

 

1.      Use DataReader to read data

2.      Construct DataTable and fill the data in.

 

It means the DataAdapter.Fill just combine DataReader.Read and DataTable.Insert. So I can create my own app to use the two functions to replace DataAdapter.Fill for test:

 

static void TestReader()

         {

             OracleConnection conn = new OracleConnection();

             conn.ConnectionString = "";