A very unlikely performance-testing scenario
I’ve been recently asked by a client to help out with a stress test on a specific enterprise solution built with several different software components, supplied by different vendors. This turned out to be quite more challenging than initially appeared, and here is the whole story (problem and solution).
As previously stated, our goal was to conduct a performance test on an enterprise solution built with different software components, running on Windows Server boxes. After an initial analysis, and once the test was designed, it became apparent that the client wanted, and needed, to get detailed data on several number-crunching routines that had not been developed with any logging or auditing requirements; furthermore, the client wanted detailed data on some specific operations, leaving us in a position where we simply could not get enough information just by profiling data from external sources such as log files.
So what exactly did we need? We needed to get performance data from several number-crunching operations that were executed by third-party software components which the client used as a black-box API (in this case we were dealing with plain old 32 bit unmanaged DLLs). To be more specific, we needed performance data for the different individual number-crunching routines which were called by a single software component (as shown in the diagram below).
Given this picture, our challenge was simple: get performance data regarding specific API calls for which there was simply no profiling data. So how do we do it? Can we track it from the calling side? Unfortunately not, as the client side itself also does not have the profiling data with the required granularity to track each individual number-crunching API call.
So this led to what seemed to be an inevitable outcome: instrument (i.e. change) the calling code and rebuild! Now the real problem was that the client didn’t have an efficient source control system, so there was no easy and guaranteed way to match source code with the production build (and testing against a different source set would just take us on a path we really didn’t want to travel), and so, here we were at the edge of what was potentially an unsolvable problem, unless we could somehow instrument the code at runtime…
Instrumenting applications at runtime
After a lot of juggling, we eventually decided that, given this very specific and unlikely scenario, we should spend a little time contemplating the idea of instrumenting the application at runtime, and so we did, by resorting to using the Detours library from Microsoft Research.
What exactly does this library provide? To put it simple, is provides an easy way to create a detour at runtime and by detour we mean a runtime change in the binding to a specific exported DLL function, by redirecting the program to another similar function which did not exist in the original build.
The detours library allowed us to create a simple DLL that exposed the same exported functions as did the DLL we wished to instrument, and then intercept calls and redirect them through our new module, working as a conceptual reverse-proxy. With this in place, we could simply acquire the profiling information through our new reverse-proxy-like DLL and forward the requests to the original component. For sake of clarity, the picture below illustrates the detour schema.
But how does this magic work? Detours uses what is called a trampoline technique, as it changes the first few bytes of the original API exported function, forcing a jump to whatever alternative entry point we define, which then performs whatever actions it wants, before eventually executing the few changed bytes and jumping to the rest of the original code (just as shown in the picture below – based on the original Detours documentation).
Having seen this magic, we were now at a position where we could create a DLL that acted as a reverse-proxy to other DLL but we remained with one problem: how could we load this new DLL into action (into the address space of the existing processes)?
The answer to this question took the form of a 10-line utility that created a remote thread in the running process and allocated virtual memory (in the same running process) to store the function argument to the remote thread’s entry point.
But how does this load the instrumented DLL in the process address space? Well, it is almost black magic, but basically we point the remote thread’s entry point to the address of LoadLibrary API call (which is always loaded through Kernel32.dll in exactly the same position in every process’s address space), and so the remote thread’s routine is actually LoadLibrary which reads its arguments from the virtual address space and hence loads the new instrumented DLL (sample code shown below – error handling omitted for simplicity).
HMODULE hKernel32DLL = GetModuleHandle(L"Kernel32");
HMODULE (*lpfnLoadLibrary)(LPCTSTR) = (HMODULE (*)(LPCTSTR))(GetProcAddress(hKernel32DLL, "LoadLibraryA"));
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, FALSE, dwProcessId);
DWORD dwBytes = 0;
size_t nSize = strlen(lpszPath);
void * pLibRemote = ::VirtualAllocEx(hProcess, NULL, nSize + 1, MEM_COMMIT, PAGE_READWRITE);
BOOL bFlag = ::WriteProcessMemory(hProcess, pLibRemote, (void*) lpszPath, nSize + 1, &dwBytes);
HANDLE hThread = ::CreateRemoteThread(hProcess, NULL, 0, (LPTHREAD_START_ROUTINE) lpfnLoadLibrary,
pLibRemote, CREATE_SUSPENDED, NULL);
DWORD dwRet = ResumeThread(hThread);
CloseHandle(hThread);
CloseHandle(hProcess);
Now that the reverse-proxy-like DLL is loaded into the process we wish to instrument all that is required is to ensure that the DLL responds to the DLL attach/detach events, executing the code that creates/releases the detour (sample code shown below - error handling omitted for simplicity).
__declspec(dllexport) BOOL WINAPI DllMain(
HINSTANCE hinstDLL, // handle to the DLL module
DWORD fdwReason, // reason for calling function
LPVOID lpvReserved // reserved
)
{
if (fdwReason == DLL_PROCESS_ATTACH) {
DetourRestoreAfterWith();
DetourTransactionBegin();
DetourUpdateThread(GetCurrentThread());
DetourAttach(&(PVOID&)OriginalFunctionAddress, DetourAddress);
DetourTransactionCommit();
}
else if (fdwReason == DLL_PROCESS_DETACH) {
DetourDetach(&(PVOID&)OriginalFunctionAddress, DetourAddress);
return TRUE;
So finally, we just need to clarify how exactly is the detour implemented and what we gain from it. Essentially, the detour changes the initial bytes of the function we wish to detour, forcing a jump to our own provided function, after which we log our profiling data. Afterwards we need to return to the original code, and to do so we execute the changed bytes which were previously saved (at a specific address known as the Trampoline function), and then force a jump to the first instruction after the changed bytes in the original DLL, just as shown in the picture below (based on the original Detours documentation).
Final considerations
As shown in this post, the Detours library (along with other similar libraries that exist out there, such as http://easyhook.codeplex.com) allow us to instrument applications at runtime, changing their inherent behavior without any re-coding, re-testing or re-deploying. Despite this awesome outcome, it should be noted that there are many limitations and drawbacks against using such a solution, namely:
Having set in ink all these warnings, it must also be said that when before dire problems we can always try less conventional solutions, and this one, in this case, did the job! Thanks Microsoft Research!
Final note: Detours can be found at http://research.microsoft.com/en-us/projects/detours/.
vascop