Welcome to MSDN Blogs Sign in | Join | Help

Windows Embedded Challenge ROCKS!

I know I said I'd be blogging on the ce_base team blog now, but I figured my personal blog is a better place for this one as it's more personal in nature.  I just finished two days of serving as a judge for the Windows Embedded Student Challenge.  I had a most wonderful time.  The quality of the work, the innovation, and the energy of the students were simply amazing.  What a diversity of projects from a diverse cross-section of the world!  How energizing to see people take the work I do and build such impressive systems on top of it!  The only disappointing part of the competition was that I didn't have a chance to see all 30 projects.

I also feel somewhat emotional about the event because it was heartbreaking to give some of these teams such disappointments.  Everyone has worked so hard and cares so much.  So to those teams out there who might read this, I want to say that just because you weren't chosen doesn't mean we didn't like your projects, or think that they could become successful products.  In some cases we had to eliminate teams because their projects just weren't 100% ready, or because we felt they didn't adequately describe important aspects of their project during their presentation.  It was clear in many cases that the teams intended to continue working on their projects, in an attempt to build a real commercial product.  I would encourage you all to continue following the dream.  If you want more feedback please do ask for it; we would be happy to help you understand what parts of the project were weak and how to improve on them for the future.

I loved meeting everyone.  Thank you for your hard work, and for sharing it with us.  I wish you all the best of luck, and hope to see you again!  (Maybe next year?)  Stay in touch, and I will see you online.  ;-)

Sue

Posted by sloh | 7 Comments
Filed under:

Windows CE Base Team Blog

We just created a new team blog, the CE base team blog, to cover kernel & file system issues.

http://blogs.msdn.com/ce_base/

I'm going to start posting there instead of here.  I probably won't have a lot of side-posts on this blog; I don't really have the energy for lots of off-topic personal stuff.  So if you are subscribed to this blog, you may want to switch over to that one.  We're hoping that the new blog will get more of our team involved in posting, and provide a more consistent stream of information (as I tend to write in bursts).

Don't worry, I still have a list of things I want to write up "when I get the chance."  See you there,

Sue

Posted by sloh | 0 Comments
Filed under:

Lock Convoys and How to Recognize Them

The problem: When using a particular application on one specific device, users noticed very bad UI response for a period of a few seconds.  After those seconds, UI responsiveness went back to normal.

 

The investigation: I used ActiveSync to drop celog.dll and oscapture.exe onto a standalone device.  I could have hooked the device up to Platform Builder and run Remote Kernel Tracker over KITL, but in this case the device I had was configured to run without KITL.  I just didn’t bother reconfiguring it. 

 

I made a .lnk file to start oscapture, and a second .lnk file to trigger it (run a second instance of oscapture with the “-c” parameter).  I dropped these onto the Start menu of the device.  I ran oscapture, and played around with the errant application until the UI slowdown occurred.  As quickly as I could (due to the UI slowdown it took a few seconds) I brought up the start menu and clicked the link to trigger oscapture.  OScapture saved its data to a file.  I then hooked up ActiveSync again, and copied the log file off the device.  Actually now that I think of it, I probably could have just connected Remote Kernel Tracker to the device over ActiveSync, but you know I’m not that smart.

 

Here is what I saw when opening up the file in Remote Kernel Tracker:

(scroll down if you can't see anything)

 

 

The main thread of pimg.exe is taking all the CPU time, and a few other threads (notably a couple in telshell.exe, which is responsible for the Smartphone UI) took little blips of time here and there, contending over a critical section.  I know they are contending because CeLog only logs critical section entry and exit events when there is contention.  Critical sections are acquired and released far too often to log uncontended operations. You don’t see any critical section exit events in this screen shot, but they’re there; they happen so close to the entry events that the “exit” icons are hidden underneath the “entry” icons.  If I had zoomed in some more, you’d see them resolve into distinct events.  All of these threads are at the same priority, 251 (THREAD_PRIORITY_NORMAL).  I know the thread priorities because I hovered over the thread lines and saw the priorities in the resulting tooltips.  But despite the equal priorities in this case, only one thread, that from pimg.exe, is getting very much CPU time.  It is running to the completion of its quantum.  I know that because I set the “time marker 1 & 2” in the time scale bar – you can see them in the screen shot.  In the bottom right of the Remote Kernel Tracker window, where it says “Marker Time Delta,” is the time between the two markers.  That time is about 100ms, the default thread quantum in Windows CE.  Each time the pimg.exe thread runs for about the same length of time, about 100ms.

 

I immediately recognized this as a lock convoy problem.  Let me describe what that means.  A lock convoy is a situation which occurs when two or more threads at the same priority frequently (several times per quantum) acquire a synchronization object, even if they only hold that object for a very short amount of time.  It happens most often with critical sections, but can occur with mutexes, etc as well.  For a while the threads may go along happily without contending over the object.  But eventually some thread’s quantum will expire while it holds the object, and then the problem begins.  The expired thread (let’s call it Thread A) stops running, and the next thread at that priority level begins.  Soon that thread (let’s call it Thread B) gets to a point where it needs to acquire the object.  It blocks on the object.  The kernel chooses the next thread in the priority-queue.  If there are more threads at that priority which end up trying to acquire the object, they block on the object too.  This continues until the kernel returns to Thread A which owns the object.  That thread begins running again, and soon releases the object.  Here are the two important points.  First, once Thread A releases the object, the kernel chooses a thread that’s blocked waiting for the object (probably Thread B), makes that thread the next owner of the object, and marks it as “runnable.”  Second, Thread A hasn’t expired its quantum yet, so it continues running rather than switching to Thread B.  Since the threads in this scenario acquire the synchronization object frequently, Thread A soon comes back to a point where it needs to acquire the object again.  This time, however, Thread B owns it.  So Thread A blocks on the object, and the kernel again chooses the next thread in the priority-queue to run.  It eventually gets to Thread B, who does its work while owning the object, then releases the object.  The next thread blocked on the object receives ownership, and this cycle continues endlessly until eventually the threads stop acquiring so often.

 

The real performance killer about a lock convoy situation is that every time the threads involved need to acquire the synchronization object, they block and take at least a couple of context switches.  Since the premise of this situation is that the threads acquire the object frequently (several times a quantum), they are running for much less than a quantum between each context switch.  The less time they run between acquires, the more context switches involved – and remember that this is all regardless of how long or short of a time they spend owning the object.

 

The other performance killer in a lock convoy situation is that it dramatically increases the percentage of CPU time non-contending threads at the same priority level get to run.  If all threads at that priority level are contending over the object, what you see is very frequent context switching between them all.  If one or more threads at that level are running continuously without blocking on any objects, those threads get to run for a full quantum each time the contending threads loop on the object.  That is what you see in the screen shot above; the pimg.exe thread is not involved in the lock convoy, but since it’s at the same priority level as all the threads in the convoy, it gets all the CPU time.  In this case, the UI threads only ran very tiny periods once each quantum of the pimg.exe thread execution, so the UI was very slow to respond to user input.  Eventually the pimg.exe thread completed its work, the UI threads completed their convoy, and the UI response time went back to its snappy old self.

 

The solution: Lock convoys are not always easy to solve.  In this case, I don’t know the solution for sure, but I believe they either lowered the priority of the pimg.exe main thread or made it do its work at another time.  While that doesn’t actually get rid of the lock convoy, it speeds it up to the point where it is unnoticeable to the user.

 

Another work-around might have been to make the pimg.exe thread yield frequently, by calling Sleep(0), to allow the other threads to do their work more frequently than once per quantum.  Note, however, that this isn’t a very ideal work-around, because it’d lead to very frequent thread switching.  Context-switching very often is a problem in itself; you spend all your time making the kernel reschedule threads instead of letting those threads run.

 

A different work-around is to reduce contention over the object by allowing it to be shared more often.  For example, suppose the object is a critical section protecting some data.  You don’t want threads to read the data while it’s being written, or write concurrently, etc.  If reads are relatively frequent while writes are relatively rare, then a way to reduce contention over the lock is to convert it to a reader / writer lock.  If writes are frequent, the improvement of using a reader/writer lock is minimal, but otherwise such a change can allow threads to read at the same time without serializing over a lock.

 

Sometimes you can work around the problem by acquiring the lock at some other time.  For example, I note that there’s a thread in filesys.exe involved in the convoy.  So I suspect the critical section that was undergoing contention was inside filesys.exe.  (While critical sections cannot be shared between processes, the non-filesys.exe threads may have used its critical section during calls into filesys.exe APIs.)  Maybe the UI threads were reading from files to draw icons, or reading from the registry, or writing to a log file.  In the read case, perhaps the UI could have kept the objects in a cache so as to avoid entering the file system.  In the write case, perhaps it could queue up file system writes to execute at a later time.  Either solution amounts to caching data and performing operations asynchronously, instead of during time-critical UI processing.

 

While lock convoys are also possible in Windows XP, they’re a little less problematic because of the priority boosting and aging that the Windows XP scheduler does.  If a thread blocks on an object, then when it becomes un-blocked the XP scheduler temporarily boosts its priority a little.  When a thread runs to the completion of its quantum, the XP scheduler ages its priority a little.  In this manner, threads that block often end up at higher priority than threads that run continuously without blocking.  Then the lock convoy isn’t eliminated, but there is less chance of another unrelated thread skunking up the works by running a full quantum each thread switch.

 

A better solution, if you can change all the code that acquires the contentious lock, is to use TryEnterCriticalSection to acquire the lock only when it’s uncontended. 

 

    retries = 0;

    while(!TryEnterCriticalSection(&CritSec)) {

        if( retries++ < SpinLimit ){

            Sleep(0);

            continue;

        }

        EnterCriticalSection(&CritSec);

        break;

    }

 

This code will use TryEnterCriticalSection a number of times, yielding the processor on each failure until it reaches the SpinLimit and then uses regular EnterCriticalSection.  The SpinLimit is necessary in case the critical section is owned by a lower-priority thread which would not run and release the critical section while the current thread is looping.  In that case the EnterCriticalSection call is required to make the kernel invert the priority of the owning thread briefly, so that the higher-priority thread can acquire the critical section.

 

You may expect this code, which basically polls for the critical section, to hurt performance.  But believe it or not, it can greatly improve it.  The difference between this loop and a regular EnterCriticalSection call is that threads do not enter the queue to acquire the critical section if some other thread owns it.  So in the example above, when Thread A releases the critical section, Thread B does not immediately become the owner.  Thread A is free to continue running and even acquire and release the critical section a few more times before Thread B comes back and tries to acquire it again.  Even with a SpinLimit of 2 (= 1 Sleep() call) two threads can recover from critical section contention without creating a convoy.

 

Alas, this solution is unavailable if you cannot change the code which acquires the critical section.

 

To those of you who might ask whether we intend to make changes to Windows CE to eliminate lock convoys, I’m not at liberty to discuss what we may or may not do in future products.  I can only say the standard non-informative spiel: Microsoft is committed to improving the quality of its products and considers all improvements in the context of other improvements we wish to make, to provide our customers the biggest benefits possible in the time available blah blah blah.  (By the way that spiel, while hokey, really is true.)

 

In the meantime I hope I’ve provided you with some information you can use to identify and escape this problem if you encounter it.

 

UPDATE 26 June 06:  Discussion of whether the TryEnter/Sleep solution hurts perf.  I don’t really think the Sleep() will hurt perf at all.  Here is what would happen:

  1. Thread 1 quantum expires while holding the CS (which is what starts a convoy)
  2. Thread 2 does TryEnter, fails, and sleeps
  3. Thread 1 runs again, acquires and releases the CS a bunch of times, and quantum expires while not holding the CS
  4. Thread 2 runs again, does TryEnter again, succeeds.  Now it can acquire and release the CS a bunch of times.

So the sleep causes a thread switch, but it’s a thread switch you’d have taken anyway if you were using regular EnterCriticalSection.  And now that I think more about this, I think your change to hold the CS for a longer period would negatively impact the perf of the TryEnter solution, because it’d increase the chances of expiring your quantum while holding the CS.  Though the impact would still not be bad because in general the threads would run to quantum completion.

Posted by sloh | 2 Comments

Windows CE eHow-tos and Tutorials! WOW!!

I saw a blip of this at MEDC, and just checked it out myself.  How cool!  It's like a DevCon for all to share!

http://msdn.microsoft.com/embedded/getstart/basics/tutorialsce/default.aspx

Most of it is pretty basic stuff, but if you're new to Windows CE, you should check some of these out!  (If you are new to performance tools like Remote Kernel Tracker which I've been posting about, take a look at the Microsoft Windows CE 5.0 Advanced Lab that's on that site.)

Sue

Using RAM over 512MB

Q: I can use only 512MB of RAM from my device. How can I address the remaining RAM ? On one of your posts you said that there is a way to go with more that 512 MB RAM. Can you help me on this issue?

A: Windows CE can't address more than 512MB.  There's no way to make the OS use it the way it uses other RAM.  OEMAddressTable, OEMGetExtensionDRAM and OEMEnumExtensionDRAM are all limited to 512MB.

The only thing you can do is write applications that specifically know about the extra memory.  They can use VirtualAlloc with address 0 to reserve some virtual address space, and then use VirtualCopy with the PAGE_PHYSICAL to map the physical memory above 512MB into the virtual address space.  In this manner the application can use the additional RAM.  But the memory won't be used by the system the way other memory is.  It won't go into the pool that's used for general operation; other applications won't ever get that memory.  Your system could use up all of the first 512MB and then "run out of memory" because it can't use the other 512MB.

If your device needs a lot of resources like this, you might want to look into Windows XP Embedded instead.

Q: Are there any plans to address this problem in future releases of Windows CE?

A: I cannot comment on what will or will not be in future releases.  I can only say, so far I don't think it is a high priority.

 

Recognizing Excessive Thread Switching

Sometimes the source of a performance problem is not that code is too slow, but that threads switch too frequently. When the system switches between threads, the system consumes extra cycles in scheduling activities such as selecting the next thread to run, or switching to the new thread, rather than spending the time executing application code. In general, the less time threads run before switching out, the more total time the system spends scheduling new threads to run.

The default amount of time that a thread runs before switching to another thread is called the Default Thread Quantum, dwDefaultThreadQuantum. For example, if the default thread quantum is 100 milliseconds, then a thread would run continuously for 100 milliseconds before another thread ran, unless preempted by a higher-priority thread.

In a system where threads run to the end of a 100-millisecond quantum, there is one thread switch per 100 milliseconds, or 10 thread switches per second. In contrast, in a system where threads run only 50 microseconds before switching, there is one switch per 50 microseconds, or 20,000 thread switches per second. Switching at such an elevated rate has a dramatic performance impact.

You can use Remote Kernel Tracker to identify these situations. Remote Kernel Tracker depicts thread switches as vertical green lines.

The following screenshot shows an example of an application with excessive thread switching. In this picture, I’ve set two markers in the time scale at the top of the display. The difference in time between the two is displayed in the status bar on the bottom right, so from time marker 1 to time marker 2 is 10 milliseconds. As you can see, there are numerous vertical green lines; each represents a switch from one thread to another.

(scroll down if you can't see anything)

 

You also can see above that the threads seem to be continuously repeating a pattern. The next picture zooms in on that pattern:

(again, scroll down if you can't see anything)

 

As you can see in the preceding picture, the “OpenDeviceKey” thread runs the longest continuous time, about 0.6 milliseconds. The next longest run-time in this screenshot, “RndisSendRndisMessage,” is 0.16 milliseconds. The greatest switching activity, between “UsbInterruptThreadStub” and “DmacIntrThread,” occurs at a rate of one switch approximately every 50 microseconds. One thread switch per 50us is a rate of 20,000 thread switches per second. This means that thread switching occurs at a rate that is about 2000 times higher than the ideal.

Let me describe where these screen shots came from:

The problem: Poor USB throughput.  On particular hardware we expected to get about 50Mbps throughput, but were getting only about 23Mbps.

The investigation: We recorded a Monte Carlo profiling trace, with these results:

HITS PER SYMBOL
Hits Percent Address Module Routine
6557 7.3 0x8012C8FC NK.EXE IDLE_STATE
4643 5.2 0x8010E330 NK.EXE INTERRUPTS_ENABLE
2882 3.2 0x02FEA884 tcpstk.dll tcpxsum
2166 2.4 0x80110894 NK.EXE ObjectCall
2123 2.4 0x03FA37B8 coredll.dll memcpybigblk
2064 2.3 0x8010DB7C NK.EXE PrefetchAbort
1732 1.9 0x80110EC0 NK.EXE ServerCallReturn
1518 1.7 0x801491D4 NK.EXE OEMWriteDebugByte
1468 1.6 0x8011A094 NK.EXE SleepOneMore
1460 1.6 0x8011BCDC NK.EXE WaitOneMore
1418 1.6 0x80119360 NK.EXE KCNextThread
1405 1.6 0x8011C154 NK.EXE SC_WaitForMultiple
1404 1.6 0x8010E00C NK.EXE IRQHandler
1255 1.4 0x8010DEC0 NK.EXE SaveAndReschedule
1212 1.4 0x03F73CA0 coredll.dll xxx_CeLogData
1145 1.3 0x03F68258 coredll.dll xxx_GetTickCount
979 1.1 0x8010EF34 NK.EXE KCALLPROFON
908 1.0 0x8010EFE0 NK.EXE KCALLPROFOFF
865 1.0 0x017A4544 pxa27x_dma.dll InterruptDone

The lines in green are all related to making API calls.  The ones in blue are all involved with thread rescheduling.  I was a little suspicious about the number of API calls and the amount of thread rescheduling that was going on here.  KCNextThread is the primary kernel function which selects the next thread to run, and having that show up 1.6 percent of the time is pretty significant.  The API call time might also be partly blamed on calling into the kernel to reschedule.

Then we looked at a Kernel Tracker trace for the same scenario, and saw the screen shots above.  I realized how much thread switching was going on, and brought it to the attention of the folks who owned the code involved.

The fix: They changed the design so that the work done between the two most active threads (UsbInterruptThreadStub and DmacIntrThread) was all on one thread instead of shared between two.  As a result, the USB throughput jumped from 23Mbps to almost 41Mbps.

Posted by sloh | 1 Comments

Remote Kernel Tracker: More things you should recognize

I have a few more examples of patterns you can recognize when you’re looking at CeLog threading data in Remote Kernel Tracker.  Note: these images are wacky heights because I tweaked them to cut off white space that didn’t contain any CeLog data.

Thread Migration

I realized that in my previous Kernel Tracker post I meant to add a screen shot of thread migration, and forgot.  This is an example of what thread migration looks like in Remote Kernel Tracker.  There’s a raised dashed line which represents that the thread is still running, but inside a different process.

(scroll down if you can't see anything)

Screen Shot 1: Example of what thread migration looks like.

KITL / debug messages

I like this example which hammers home how much work is involved in heavy KITL traffic.  I run a program called enumfiles.exe which enumerates all of the files in the device file system.  The KITL traffic is due to a few different things: mainly the heavy debug messages, and the enumeration of files inside the release directory via RELFSD.  (The release directory file system driver, which mounts the release directory from the desktop PC as \Release on the device.)  In this case I also loaded the program over KITL from my release directory, so some of this traffic is the work copying code onto the device.

Screen Shot 2: For a program that’s just enumerating the file system, that’s a lot of time on the KITL thread.

As you can see, when the enumfiles.exe program starts, a lot of other activity begins too.  The KITL interrupt (25 on this device) and the KITL IST start working overtime.  I like this example because it shows that your code doesn’t always live in the vacuum you might expect.  The system calls you make, or the pure existence of your program, can fire off other activity in the system.  Does the system really behave as you expect?  Tools like CeLog and Remote Kernel Tracker are the way to find out for sure.

Mike Hall uses an older version of my enumfiles program (walktree.exe) when he gives labs to teach people how to use Remote Kernel Tracker, so if you’ve taken any labs from Mike at DevCons or other events, you may have played with this same program yourself!

Flushing

It’s good to know what activity represents side effects of data gathering itself.  Whether you have Remote Kernel Tracker actively connected to a device and collecting data, or you’re running CeLogFlush.exe to save data on a standalone device, the thread that manages the data impacts the system a little itself.  First, Screen Shot 3 shows CeLogFlush.exe flushing periodically.  In this case I specified that CeLogFlush.exe should flush at 3-second intervals, so that more than one flush would show in the same window.

Screen Shot 3: CeLogFlush data flushing at 3 second intervals when data’s not being logged heavily.

As you can see above, when there’s not a lot of activity, flushing can be a pretty fast operation.  The vertical lines showing a lot of thread switching aren’t very wide; the flush is over quickly.

Incidentally, you can see that the switching is between the CeLogFlush.exe thread and the KITL thread – OK now tell me where I was storing the log file?  That’s right, in \Release which required KITL to transport the file data to the release directory on the PC.  Otherwise there’d be less thread switching because it could write to local storage on the device without switching to the KITL thread.

In contrast, Screen Shot 4 shows how much of a load flushing can be on a busy system.  There are a lot more events being logged, so there is more data to flush and flushes take longer.  The RAM buffer that the data is temporarily stored in fills more quickly than 3 seconds, so the flushes are more frequent.

Screen Shot 4: CeLogFlush flushing during heavy logging activity.

For completeness I’m including Screen Shot 5 which shows what flushes look like when you’re using Remote Kernel Tracker to transport the data to your PC, instead of using CeLogFlush.exe.  You can see that RTH.EXE does the flushing, and that it flushes over KITL about once per second.

Screen Shot 5: Remote Kernel Tracker data flushing.

Data Loss

I wrote up a couple of sections about data loss in our recent newly-released CeLog documentation:

·        Analyzing Data Loss with Remote Kernel Tracker

·        Analyzing Data Loss with Readlog

But I could not include screen shots.  Here is a screen shot to show you how to recognize data loss when you’re looking at it in Remote Kernel Tracker.

Screen Shot 6: The signs of data loss.

The three characteristics I describe in the Kernel Tracker article are visible in Screen Shot 5.  First, it pops up a dialog box telling you how many bytes of data have been lost over time in the log.  Second, the interrupt activity appears to stop completely, and one thread appears to be running for a long period.  In this screen shot I didn’t expand explorer.exe to show that thread, but it was a single thread and it had no icons showing events occurring during the thread run-time.  Finally, when activity picks up again, the flush thread (RTH.EXE) is busy working.  All of these characteristics are strong hints that the apparently “inactive” period is really a period of time from which data was lost.

More Tips

While I was poking around building up screen shots for this write-up, I caught one view that surprised me a bit.  Look how much priority inversion is going on here!  This is a database test program spewing lots of debug output.  KITL isn’t something that actually requires “real” real-time, but perhaps I should get someone to examine why we hit priority inversion so regularly.

Screen Shot 7: Debug output leading to lots of priority inversion.

Let it be a lesson that while you’re testing real-time response in your system, you should disable KITL.  And that while you’re running performance tests, for heaven’s sake don’t write a lot of debug messages.

Another little exercise.  On the left and right of Screen Shot 7 are some periods where activity drops dramatically.  Do those periods represent data loss?

The answer is no, because of two clues.  First, SYSINTR 0 is still going strong.  Keep an eye on that whenever you’re running a profiling build. (I’d say all the time but only profiling builds record interrupts.) SYSINTR 0 occurs frequently enough you should always be able to find it.  Second, there are lots of events occurring on the testdb.exe thread during those periods.  There are so many icons drawn on that thread that they don’t show up clearly in this screen shot.  If it was data loss, there’d be no such activity.  Since this particular test program was writing a lot of debug output; the less busy periods probably represent time when the test was doing solid periods of work instead of printing debug messages.  When you’re not running a profiling build, look for events on the thread that was supposedly executing during the quiet period, to see if that really is a period of data loss.

 

Posted by sloh | 1 Comments

Introduction to Remote Kernel Tracker

When you connect to a device with Remote Kernel Tracker, or when you open a log file that has been collected by one of the other CeLog tools, you’ll see a lot of data you may not understand.  At MEDC this past week, a few people asked for a better explanation of what they’re seeing.  So let’s go over it.  Screen Shot 1 shows a sample of what you might typically see when you connect Remote Kernel Tracker to a device:

(scroll down if you can't see anything)

 

Screen Shot 1: Typical idle behavior, no processes expanded.

Someone asked me what all the information in the key means.  I’ll go through them a bit at a time.  First, the thread and process state lines.

     

You can see examples of the process state lines in Screen Shot 1 above.  Each process in the system has a horizontal line.  When one of the threads owned by that process is running, the process line is wide and dark green.  Otherwise the process line is thin and black.  (In case it’s not clear, only one thread ever runs at a time, so only one process ever runs at a time too.)  So in the example above, threads owned by kernel (nk.exe) are running almost all the time.  The vertical lines you see are due to threads in other processes running for very short periods of time.  Remote Kernel Tracker draws vertical lines to represent switches between threads, to make it easier to see which threads ran before and after the thread switch.

Underneath each process are all the threads owned by that process.  Processes can be expanded and collapsed to show and hide the threads, for easy viewing.  There are no thread lines in Screen Shot 1, because none of the processes have been expanded to display their threads yet.  Screen Shot 2 shows the same log, with some of the process lines expanded and with the key hidden for better viewing.    

Screen Shot 2: Typical idle behavior, interrupts and some processes expanded.

In Screen Shot 2 you can see some thread lines too.  Like the process lines, the thread lines are wide and light green whenever the threads are running, and thin when they are not. Most of the threads, except for “0x00000000,” run for such short times that at the zoom level shown in this screen shot, that you can’t see the horizontal thread-run lines, but you can tell that they ran because you can see the vertical thread-switch lines.  On the threads you can also see some other icons representing the actions taken by the threads.  More on these later.

You can also see that interrupts are represented as lines too, with one line for each SYSINTR value to show when interrupts occur.

What you’re seeing in these screen shots is typical idle activity on a device.  The thread with handle “0x00000000” is not really a thread, but the placeholder to represent when the CPU is idle.  Most of the other threads that run now and then are interrupt service threads (ISTs).

Every process will have one thread which has the same name as the process; that is the process’s primary thread, the one which starts running at main() or WinMain() or whatever flavor of entry point the process might use.  All of the other threads in the process are given the name of the thread’s start routine, such as KITLInterruptThread under the process “NK.EXE” (the kernel).

An important detail to note at this point is that some of the features you can see here are due to the fact that you are running a profiling build.  First, to minimize the impact of the logging hooks which record this data, interrupt data is only available on profiling builds.  When the profiler is not present, the “Interrupts” line will not appear.  Second, the thread names are also only available on profiling builds.  Remote Kernel Tracker does not look up thread names from the symbol files in the release directory.  Instead, on profiling builds the CeLog event logging hooks obtain thread names via the special symbol table that is included in ROM for use by the profiler, to represent all the code that is stored in ROM.  The profiler uses that symbol table to look up symbols for profiling hits, and the logging hooks take advantage of the symbol table in order to provide thread names.  Threads from processes and DLLs that are not stored in the “modules” section of ROM will not have names, since they will not be listed in the profiler’s symbol table.  And on non-profiling builds, no threads will have names.

There is also a trick you can use to make the desktop tool “readlog.exe” look up symbols from the release directory and build a new log file containing the thread names on any build and regardless of where the thread’s code is running from.  I don’t quite have documentation to point you to about that yet, but open up a release directory build window and type “readlog -?” on the command line to find out how to use readlog.

The other thread line I haven’t discussed yet is the Thread Migrate line.  In Windows CE, when a thread calls a system API that is implemented by a different process, the thread moves from its owner process into the system process for the duration of the call.  The Thread Migrate line in Remote Kernel Tracker represents the period of time during which the thread is executing inside a process other than its owner.   You can hover over the line to learn which process the thread is executing inside.

In my opinion, this representation of thread migration is not terribly useful.  It helps you see when a thread is running inside a process other than its owner, but it does not help you see the entire set of processes the thread transitions to.  For example if your thread calls an API, and that API calls a different API, your thread may actually transition through multiple processes, but it is not easy to see if or when those transitions occur.  Nor does it help you understand why those transitions took place (which APIs the thread was calling).

Back to the event types in the key.  The main events I pay attention to when I’m looking at a log in Remote Kernel Tracker are a subset of those related to thread scheduling:

       

If you are familiar with the Win32 API set, you probably can guess what these mean.  The critical section events represent when threads call the EnterCriticalSection and LeaveCriticalSection functions, and encounter contention over the critical section object.  Critical sections are used so widely that logging every entry and exit would generate too many events and bog down system performance.  So only those calls which block the current thread or unblock another thread are logged.  The Set Event, Reset Event and Pulse Event events represent actions taken on Win32 “event” objects.  Similarly the Release Semaphore and Release Mutex events represent actions on Win32 semaphore and mutex objects.  The wait event represents a call to the Win32 WaitForSingleObject or WaitForMultipleObjects APIs, regardless of whether the thread blocks on those objects or not.

Finally, the Sleep Event event represents when the thread calls the Win32 Sleep API.  In general I am not a big fan of Sleep, at least non-zero Sleep periods.  Often it represents lazy programming, and results in polling that churns CPU time and drains power.  For example, observe the behavior of the selected thread in Screen Shot 3 below.  This thread is polling at 500ms intervals.

Screen Shot 3: Using Sleep to poll at regular intervals consumes CPU and drains power.

If you hunt around most logs, you’ll find cases where one thread will call the SetEvent API, then another thread will start running, do some work, and then block again in WaitForMultipleObjects on the same event handle.

Screen Shot 4: A waiting thread is awakened when another thread signals an object.

In Screen Shot 4 you can see that the threadsample.exe main thread set an event (the “flag” icon) that the last thread in the same process was waiting for.  In this screen shot I double-clicked on the WaitForMultipleObjects event and the Set Event event to show their data.  This often makes it easier to compare handle values between events.  I know these two events are related because they both apply to handle value “0xA3E0FF96” in this case.

There are a few other events I also look at when I’m examining logs.

         

The interrupt event you’ve already seen above, representing when interrupt service routines (ISRs) execute.  It’s worth mentioning that you can see relationships between interrupt service routines (ISRs) running – the “interrupt” events you see in Kernel Tracker – and their corresponding interrupt service threads (ISTs) waking up to do further interrupt handling.  In Screen Shot 5 below, you can see the ISR for SYSINTR 1 (the timer interrupt, I believe?) run, and then two different ISTs follow up.  You can also see the ISR for SYSINTR 30 (the KITL interrupt) run, and the KITLInterruptThread wake up to do some work.  If you look back at Screen Shot 2 above, you can also see this relationship between the KITL interrupt and KITLInterruptThread when scaled back to a larger zoom level.  Note that no IST runs for SYSINTR 0, which is a no-op. 

Screen Shot 5: The relationship between ISRs and ISTs.

The Invert Priorities event is shown on a thread when the kernel temporarily changes the thread’s priority because it holds a resource that some other higher-priority thread needs.  For example, if a low-priority thread holds a critical section that a high-priority thread tries to enter, the kernel temporarily boosts the low-priority thread to a high priority, to execute until it releases the critical section.  Then the low-priority thread returns to its original priority.  For real-time behavior, you never want real-time threads to block on resources owned by lower-priority threads, since those threads may do an undetermined amount of work before releasing the resources.  Screen Shot 6 is an example of a case where a higher-priority thread ends up boosting the priority of another thread over and over again.    

Screen Shot 6: Recurring cases of priority inversion.

The icons in Screen Shot 6 are a little hard to see – sorry – but what’s going on is that a high-priority IST in the kernel enters a critical section that is held every time by a low-priority thread from testdb.exe.  The kernel thread logs a priority-inversion event representing the priority-boost given to the testdb.exe thread.  Then the kernel thread blocks on the critical section.  The now-high-priority testdb.exe thread runs until it leaves the critical section, then logs a priority-inversion event representing its return to its regular priority.  The kernel thread then begins running again, now that it owns the critical section and is the highest-priority thread.  You can see in this log that the inversion happens regularly in this particular case.

The page fault event represents a costly page fault operation occurring on a thread.  In CE 4.2 and earlier, a page fault event was logged at the beginning of a page fault; in CE 5.0 and later, an event is also logged at the end of the page fault so that you can tell how long the page fault took.  I don’t see this event often, but sometimes I see cases like this, where a thread incurs many page faults:

Screen Shot 7: A thread incurring many page faults.  This log is from CE 5.0 so the fault markers are in begin/end pairs so that you can tell how long they took.

When I see many page faults like this, I run the readlog.exe command-line tool to get a summary of the faults, to determine where they are occurring.  In general the important thing to check is whether the fault address is paged more than once.  Code must be paged once in order to execute at all, but if it is paged more than once, that means the kernel’s page pool is not large enough to hold all the executing code.  In that case the page pool may need to be enlarged.  If data such as memory-mapped files is paged more than once, then the data may not have good page locality, or the system may be in a low-memory state.

You can also log your own events from your code.  Since Remote Kernel Tracker does not have the ability to interpret your data blobs, you may find it useful to log your data using a set of event types that Remote Kernel Tracker does understand:

     

When Remote Kernel Tracker sees one of these event types, it expects the data logged for that event to be an array of the given type.  For example, for the “Raw ulong” type Remote Kernel Tracker interprets the data as an array of unsigned long integers (DWORDs).

     

For the “Raw wchar” type the data is an array of Unicode characters, and Remote Kernel Tracker will display it as a string.

     

To see the data corresponding to one of these event types, hover over the event to see the tooltip, or double-click on the event to get the data in a separate window.  Note also that the data is displayed in the bottom-left status bar when you click on an event.     

Screen Shot 8: Viewing “raw” data in Remote Kernel Tracker.

 

In Summary

I hope I have helped you understand what the tool is showing you, and see some of the things you can use CeLog and Remote Kernel Tracker for.  This is a very versatile tool.  It might take some practice to recognize these relationships between events, but a basic understanding of what thread behavior looks like in Kernel Tracker is the first thing you need when you’re troubleshooting a wide range of performance problems.  Remote Kernel Tracker enables you to track down bugs related to thread interactions that are difficult to track with traditional development tools, including deadlocks, race conditions, excessive context switching or priority problems.

I have some more posts coming in the near future, about more scenarios you can recognize when you get used to Remote Kernel Tracker.

 

Posted by sloh | 9 Comments

CeLog documentation

At long last, a big chunk of work I did to add documentation of CeLog & Remote Kernel Tracker has finally shown up in a form you can view.

http://msdn.microsoft.com/library/en-us/wcedebug5/html/wce50conEventTracking.asp

I did a huge brain dump and tried to get all of my tips & tricks down on paper.  There are descriptions of how to use tools to gather data on standalone devices, how to log your own data and how to customize the tools.  There is a little bit more work to go -- some information really required Kernel Tracker screen shots, and right now we can't include screen shots in our documentation.  We've been discussing other formats to release the information in, but for now I am planning to post a bunch of it on my blog.  I have it mostly written up, but am trying to set up a web share to hold the pictures.  (I don't have any storage space to use with my blog, so I can only post text at the moment, but that should change soon.)

So look for more in the coming days!  I got to show off Remote Kernel Tracker to a lot of folks at MEDC, and got some good questions.  I hope that between the documentation posted above and the material I'll add to my blog soon, you'll get to be as much an expert in using CeLog and Remote Kernel Tracker as I am.

Happy logging!

Sue

Posted by sloh | 0 Comments

How does the OS use the registry?

Q: As an embedded programmer I am new to all windows development. And as such I am ignorant of the registry. It seems to play a crucial part in the initialization of the OS (device drivers, services, etc) Do you know of a good source of information about how the OS (not individual applications) uses the registry?

A: Sorry, I don't have any good sources of information to point you at. The registry is sort of a cross between a database and a file system -- the data is stored hierarchically like a file system, in small values with different types. The OS mostly uses the registry for reading in configuration information that you (or applications) can change. There isn't one big document for all registry information that is used by the system, but I know our documentation team has been making an effort to document all the settings on an area-by-area basis. For example, http://msdn.microsoft.com/library/en-us/wcedata5/html/wce50oriFileSystemsDataStoreRegistrySamples.asp

Most parts of the OS have default registry settings that they use, some of which may be controlled by relatively high-level flags you can set like PRJ_* and IMG*. http://msdn.microsoft.com/library/en-us/wceosdev5/html/wce50conPRJEnvironmentVariables.asp

The OS should be able to boot and run based on these variables, without you having to tweak many (any?) individual settings. Then you can tweak things to modify OS behavior as needed. The hard question I can't answer is what settings you might want to tweak. I guess you will have to find those out as you go.

Sorry I'm not more help than that,

Sue

 

More CE database Q&A

Posting some questions I received from my blog.

1. You mentioned the that the maximum size of an object store is 256MB. Which versions of Windows CE does this apply, Windows Mobile 2003, Pocket PC 2002?

It applies to all versions of Windows CE.

2. Do the limitations of database in the object store apply to databases created in the filesystem?

Databases in the object store cannot exceed CEDB_MAXNUMRECORDS (=64K) records.  Database volumes (files containing databases) cannot exceed 16MB and the databases inside those volumes still cannot exceed CEDB_MAXNUMRECORDS records.

3. Is there a maximum size of a database record in an Object Store? If yes what is it?

4. Is there a maximum size for a field in a database record? If yes what is it?

For both the object store and database volume files, one record cannot have more than CEDB_MAXRECORDSIZE bytes (=128KB), and one property value cannot exceed CEDB_MAXPROPDATASIZE bytes (a little short of 64KB).

Be careful of the CEDB terminology difference compared to other database engines: A CEDB "database volume" is a "database" to other engines, and a CEDB "database" is a "table" to other engines.  The database volumes were created to let you store your data in a file instead of in the object store, so you can think of the object store as just a special-case "default" database volume.

If you are running into these size limitations, you are probably far better off with SQL-CE than CEDB.  Or with a new offering (as of CE 5.0), EDB, which is a trimmed-down version of SQL that presents a programming interface very similar to CEDB for easy porting.  CEDB has very poor performance once you go over ~1000 records.

 

Posted by sloh | 4 Comments
Filed under:

GetTickCount – Truth and Fiction

I was surprised and dismayed to read a recent article in Embedded Systems Programming (http://www.embedded.com/showArticle.jhtml?articleID=159902113) that gets so many things wrong about the GetTickCount API and portrays Windows CE so negatively.

Apparently I’m behind on my reading, since the same author also wrote a previous article on the same subject, to which Mike Hall wrote a rebuttal on his own blog: http://blogs.msdn.com/mikehall/archive/2005/01/28/362498.aspx  Reading the more recent article makes me want to write my own rebuttal.

GetTickCount is a pretty simple API from the caller’s perspective.  Every millisecond the tick count increments, and you can use GetTickCount to retrieve the number of milliseconds since boot.  Since GetTickCount returns only a 32-bit number, after about 49 days, the counter wraps.  This is documented behavior, and to properly use GetTickCount you have to understand that.  Typically GetTickCount is used to time the duration between two events, in which case you’re generally safe if you subtract two values; subtraction is safe in the presence of rollover.  (eg. If you get a tick count of 0xFFFFFF00 before the rollover, and a tick count of 0x200 after the rollover, subtraction gives you get a difference of 0x300 as expected.)  The only time subtraction can get you into hot water is if there’s a chance the time delta will exceed 49 days, because you may end up needing a difference that’s larger than you can represent in 32 bits.  In which case GetTickCount is the wrong API for you.  I guess in that case you would probably need to implement something using GetSystemTime and SystemTimeToFileTime.  Applications that use GetTickCount get into trouble if they are subtracting over such a long time period, or if they are using something besides subtraction with the tick count.  For example,

   if (GetTickCount() > MyTickValue) { ... }

will also get you into trouble.  If I were to guess, that’s where I’d say most applications probably go wrong using this API.

To help catch such errors in applications and drivers, Windows CE does a little thing on debug builds – it initializes the tick count such that it rolls over 3 minutes after boot.  The author talks about our 3-minute rollover as if it’s indicative of a problem in the OS, when really it’s just a meager attempt to help catch bugs in applications.  It’s not much help really, if you ask me, but it might help catch a bug or two.  I’d love to improve on it, but it’s tough to arrange for the timer to roll over at a really useful time for testing your application.  For example you might think we could create an IOCTL you could call, to set the timer at run-time.  But making the timer jump at run-time could mess things up.  Suddenly drivers and applications would think 47 days have passed, maybe network drivers time out, all your appointments fire, who knows…  I’m making things up since I don’t really know how networking or appointments are implemented, but you get the idea.  If you really want to be careful about testing your application or driver that uses GetTickCount, probably the best thing to do is create a wrapper that your code uses to call GetTickCount, and arrange for your wrapper to manipulate the times.  I’d love to see suggestions, and if you can come up with something good we can do in the OS, hey, maybe we will take your advice.

So now I’ll describe how GetTickCount is implemented internally.  First off, the function is technically owned completely by the OEM, though we provide as many implementations as we can manage, so that OEMs can use those.  The implementation varies per CPU and per OAL, but in general, there’s a 32-bit counter, CurMSec, that is incremented once per millisecond with a timer interrupt.  That millisecond timer interrupt is also used for other things, like scheduling threads.

To conserve power, when the kernel has no threads to schedule, the system goes into an idle state (implemented by the OEM’s OEMIdle function) where the timer interrupt is extended to a longer period.  That allows the CPU to spend a longer time in a low-power state.  The idle period is ended when an interrupt fires (the extended timer interrupt or some other interrupt).  When the system leaves the idle state, OEMIdle updates CurMSec with the amount of time that the system was idle, using whatever timer the hardware has.

All the magic is really in the OEM’s implementation of OEMIdle.  If you want some code to look at, see the CE 5.0 help article.

Now, back to the article.  Here are my responses:

  • One of the things that floors me about this article is that it claims that GetTickCount counts downward, not upward.  That is just plain wrong.  I don’t know what gave the author that impression but the counter starts out at 0 and goes up.  If you see any documentation or code that claims otherwise, please tell me and we’ll get it corrected.
  • The author also asks whether the counter “sticks” at the same value after rollover, which it doesn’t.  He seems to imply that it does though.
  • GetTickCount also works just fine on 16-bit and 64-bit CPUs, though the author implies that it doesn’t.
  • The article brings up cases where counters are non-monotonic, where the counter jumps backwards by a few ticks.  If that happens, it means the timer is not implemented correctly by the OEM.  Most likely something in OEMIdle is not right.  We do provide documentation of how to implement a timer (here’s some), standard implementations for different CPUs, and tests to verify that the timer is implemented correctly.  But I’m sure some people would argue that we don’t do enough to help OEMs get this right, and maybe they’re right.  Let’s discuss it.  What else would you like to see?  What trouble have you had?
  • All the other examples of badness that the author uses come from desktop Windows as far as I can tell, and they are problems with applications that use GetTickCount improperly, not with Windows itself.
  • You could argue that this is too complicated, that GetTickCount should just read the timer hardware straight, and scale to milliseconds.  I think the main reason it was done this way was to standardize between hardware with a count-compare type timer and hardware with a count-down timer.  It also avoids doing division to convert to milliseconds, especially 64-bit division since 64-bit timers are common.
  • The author actually suggests using a stopwatch to measure elapsed time during a performance test.  Even if that was accurate enough, it’s a pain and it’s not automatable.  How could you run tests regularly to make sure that nothing got worse?  I fully support the author’s idea that you should loop many times, so that the performance test runs long enough that you could time it with a stopwatch.  But that’s for reasons of repeatability, to get rid of variance.  I don’t believe a stopwatch is the right answer.  For timing code run-times, GetTickCount and QueryPerformanceCounter have satisfied every need I’ve seen so far.

I guess the thing I disliked the most was that the author took a list of implementation complications, application bugs from desktop Windows, incorrect information and “what if’s,” and turned it into a negative portrayal of Windows CE.  Maybe I am just too sensitive about the product I pour so much energy into.  Too many people assume that Microsoft developers don’t care, when the truth is very much the opposite.

Oh man has this discussion gotten long.  Have I ever got just a few words to say about something?  Oh well.  Write back if you have opinions to add or if you think I got any technical details wrong.

Sue

On the lighter side

It is actually pretty funny to read this page in gizoogle: http://sites.gizoogle.com/?url=http://blogs.msdn.com/sloh/

It's really just an updated version of "jive," but it has been a while since I played with those filters so it's fun again.  :-)

Im tha Double O G. and I'm not straight trippin' any raps.  Yeay cuz its a pimp thang!

 

Posted by sloh | 0 Comments
Filed under:

DevCon 2005

Yeay!  I'm going to this year's DevCon, which covers both Mobile and Embedded this year (instead of just Embedded as I've gone to in years past).  Should be a hoot.  I'm signed up as plain staff, not giving any talks (so far!), so I'll be working the booth, Ask the Experts, etc. For more info on DevCon see http://www.medc2005.com/.

Sue

 

Posted by sloh | 0 Comments
Filed under:

Resolving Symbols Manually on Windows CE (ADDRESS --> SYMBOL)

You know you've been there.  You are looking at a callstack, and the one function you want to know the identity of, failed to resolve symbols or has bad symbols.  Dang!  If only you knew what function 0x12345678 refers to, you'd find your bug.

First thing's first: you have to understand the Windows CE virtual memory layout.  You have to understand what slots are, and recognize slot 0 & 1 addresses versus addresses from process slots.  That’s why I blogged about that first.  If you internalized all of those details already, then you’d know that 0x12345678 was inside the process that’s running in the slot starting at 0x12000000.  In fact, it’s at offset 0x00345678 from the start of that .exe.  Heck, you’re almost to the symbol already!

But this shows that the 2 main steps to resolving a symbol are: figuring out what module the address lies within, and figuring out what symbol lies at that offset within the module.  If you need to get all the way down to the line of source code involved, you can figure that out too, by figuring out what line of source code is at that offset within the function.

Figuring out what module it is

To figure out what function or variable is at a particular address, first you need to know what module (DLL or EXE) is loaded into that chunk of RAM.  So first you eyeball the address and decide whether it’s a DLL or an EXE.  Since EXE code & data starts at the bottom of the slot, and DLL code & data starts at the top, it’s pretty easy to figure out.  0x12345678 is an EXE, 0x13987654 is a DLL.

Then you look for the DLL or EXE that is the closest match.  There are a bunch of ways to do this.  You can open up the Modules and Symbols Window, sort by address, and find the module that matches.  If you are looking up the symbol for a function, sort by “Image Address Range.”

Modules and Symbols Window
Module         Image Address Range    Relocated Data Address Range

lmemdebug.dll  0x01AC0000-0x01AC4FFF
kbdmouse.dll   0x01BA0000-0x01BC1FFF
devmgr.dll     0x03EE0000-0x03EE9FFF  0x01FE9000-0x01FE92A4
fsdmgr.dll     0x03F50000-0x03F60FFF  0x01FF7000-0x01FF73F0
coredll.dll    0x03F80000-0x03FD8FFF  0x01FFB000-0x01FFBD64
filesys.exe    0x04010000-0x04041FFF
shell.exe      0x06010000-0x06022FFF
device.exe     0x08010000-0x08012FFF
gwes.exe       0x0C010000-0x0C0CBFFF
services.exe   0x12010000-0x12017FFF
nk.exe         0x81200000-0x8123DFFF  0x81FA0000-0x81FFD7BF
hd.dll         0x8123E000-0x8123FFFF  0x82002000-0x82002954
osaxst0.dll    0x81240000-0x81246FFF  0x82003000-0x82006328
giisr.dll      0x81249000-0x8124AFFF  0x82008000-0x82008504
osaxst1.dll    0x82042000-0x82046FFF
kd.dll         0x82047000-0x82060FFF

If you’re looking up the symbol for a variable, then for a DLL or EXE with a “Relocated Data Address Range” the variable would fall inside that range; for a DLL or EXE that does not have relocated data, the variable would fall inside the image address range.  So you may need to sort on both in order to find the right address range.

Another option you have is that you can type “gi proc” or “gi mod” in the Target Control Window to get the location where the code is loaded.  The exception for this is the kernel, nk.exe, which reports a different value under “gi proc” than where its code is loaded.  (These lists don’t show relocated data addresses, so they’re only useful for looking up function symbols.)

Target Control Window

Windows CE>gi proc
PROC: Name            hProcess: CurAKY :dwVMBase:CurZone
 P00: NK.EXE          03fb4002 00000001 c2000000 00000000
 P01: filesys.exe     03f4b69e 00000002 04000000 00000020
 P02: shell.exe       03dcae86 00000004 06000000 00000000
 P03: device.exe      e3d60f6e 00000008 08000000 00000000
 P05: gwes.exe        4389f8aa 00000020 0c000000 00000000
 P08: services.exe    03497182 00000100 12000000 00000000
 
Windows CE>gi mod
 MOD: Name            pModule :dwInUSE :dwVMBase:CurZone
 M35: coredll.dll     83fb4794 7ffbffff 03f80000 00000000
 M45: devmgr.dll      83d36250 00000008 03ee0000 00000000
 M58: fsdmgr.dll      83fb1ea0 00000002 03f50000 00000000
 M60: giisr.dll       8390c5dc 00000009 81249000 00000000
 M65: hd.dll          83fb425c 00000001 8123e000 00000000
 M75: kbdmouse.dll    8383be54 00000020 01ba0000 00000000
 M76: kd.dll          82ce7e40 00000001 82047000 00000000
 M80: lmemdebug.dll   83d60d30 0000003c 01ac0000 00000000
 M98: osaxst0.dll     83fb44c0 00000001 81240000 00000000
 M99: osaxst1.dll     83103d6c 00000001 82042000 00000000

Other possible options – if you have data from a Remote Kernel Tracker / CeLog output file (.clg), that file usually contains a “re-sync” (the output of the CeLogReSync API) where CeLog dumps a list of all the existing threads, processes and modules.  (This list doesn’t contain relocated data addresses either.)  Here is some sample output from the “readlog” command-line tool which parses CeLog output files, edited a little to fit this web format.

Readlog.exe Output File
--:--:--.---.--- : Marker, counter frequency=1193180 Hz, default thread quantum=100
 0:18:33.578.716 : ProcessCreate, dwVMBase=0x81200000, NK.EXE
 0:18:33.578.779 : ProcessCreate, dwVMBase=0x04000000, filesys.exe
 0:18:33.578.870 : ProcessCreate, dwVMBase=0x06000000, shell.exe
 0:18:33.578.901 : ProcessCreate, dwVMBase=0x08000000, device.exe
 0:18:33.579.666 : ProcessCreate, dwVMBase=0x0C000000, gwes.exe
 0:18:33.579.979 : ProcessCreate, dwVMBase=0x12000000, services.exe
 0:18:33.581.421 : ModuleLoad, dwBase=0x82047000, kd.dll
 0:18:33.581.450 : ModuleLoad, dwBase=0x82042000, osaxst1.dll
 0:18:33.582.422 : ModuleLoad, dwBase=0x01BA0000, kbdmouse.dll
 0:18:33.582.559 : ModuleLoad, dwBase=0x81249000, giisr.dll
 0:18:33.583.481 : ModuleLoad, dwBase=0x03EE0000, devmgr.dll
 0:18:33.583.494 : ModuleLoad, dwBase=0x01AC0000, lmemdebug.dll
 0:18:33.583.673 : ModuleLoad, dwBase=0x03F50000, fsdmgr.dll
 0:18:33.583.699 : ModuleLoad, dwBase=0x03F80000, coredll.dll
 0:18:33.583.714 : ModuleLoad, dwBase=0x81240000, osaxst0.dll
 0:18:33.583.727 : ModuleLoad, dwBase=0x8123E000, hd.dll
--:--:--.---.--- : Sync End Marker

Another option you have is to use the ToolHelp APIs (CreateToolhelp32Snapshot, etc) to programmatically gather this data.

It’s relatively straightforward to figure out, given an address, which module that address falls inside.  Use the “Price Is Right” matching: whichever module is closest to that address, without going over, is the matching module.  [“The Price is Right” is an old US TV game show.]  Address 0x8123FFFF is inside hd.dll; 0x81240000 is inside osaxst0.dll.

The main thing you have to be careful about is recognizing addresses that are inside process slots, and converting them to slot 0 addresses.  Address 0x09EE5678 is inside devmgr.dll, not device.exe.  (ERRATA: See comments below for the error in this statement.  I've left the text here unchanged so that you know what I was talking about.  For the sake of the remainder of this write-up, just pretend devmgr.dll loads at address range 0x01EE0000-0x01EE9FFF instead of the range I listed in all the examples above.)

Figuring out what function or global variable it is

Once you know what DLL or EXE the address is within, you can simply subtract the DLL or EXE base address from the address in question, to get the offset of the address from the start of the DLL or EXE.  So address 0x09EE5678 is at offset 0x00005678 from the start of devmgr.dll.

Then you look up the offset from the .map file for the DLL or EXE.  You should already have the .map files from most modules you work with, but if you are missing one, it may be because the WINCEMAP environment variable must be set to 1 when the module is  linked, in order to produce the .map file.

Here’s an excerpt from devmgr.map.  The important things in the .map file are the function names and the addresses on the right side (highlighted):

Preferred load address is 10000000

 0001:000044b0 _ConvertStringToGuid       100054b0 f devcore:devpnp.obj
 0001:00004546 _FindDeviceInterface       10005546 f devcore:devpnp.obj
 0001:0000456c _I_AdvertiseDeviceInterface 1000556c f devcore:devpnp.obj
 0001:00004774 _PnpAdvertiseInterfaces    10005774 f devcore:devpnp.obj
 0001:00004a9b _PnpDeadvertiseInterfaces  10005a9b f devcore:devpnp.obj
 0001:00004ae5 _InitializePnPNotifications 10005ae5 f devcore:devpnp.obj

The address on the right is saying where the function would live in RAM if the DLL was loaded at its PreferredLoadAddress – which is stored at the top of the .map file, but is almost always 0x10000000 for a DLL.  Which makes it easy, because you just chop off the top part of the address to find out where the function would live if the DLL was loaded starting at 0.  So offset 0x00005678 from the beginning of devmgr.dll is the symbol I_AdvertiseDeviceInterface.  Any offset between 0x0000556C and 0x00005773 inside devmgr.dll falls inside I_AdvertiseDeviceInterface.  And, since devmgr.dll is loaded at address 0x03EE0000, any address between 0x03EE556C and 0x03EE5773 corresponds to I_AdvertiseDeviceInterface.  So again you are basically using “Price is Right” matching.

EXEs are a little different, but even easier.  Most EXEs will be linked with a PreferredLoadAddress of 0x00010000, but that is the spot the Windows CE kernel always loads an EXE at anyway.  So there’s no subtraction involved – just use the address to the right of the symbol.

Once you get used to doing this stuff, it’s not that hard, just a little manual labor.  In general the formula for going from ADDRESS à SYMBOL is:

   (.map file offset) = (address) – (module base addr) + (module preferred load addr)

And remember to use “Price is Right” matching for the module and the symbol: whichever has the closest address without going over is the closest match.

Figuring out what source line it is

You can even get down to the source line if you have the COD files to look at.  I have rarely had to do this kind of lookup, but it is occasionally useful.  The two main uses I’ve had for it are: figuring out exactly which line of source/disassembly a crash or debug break occurred on, and figuring out which instance of a function call is involved in a callstack.  For example if function Foo calls function Bar twice, and I’m looking at a callstack which involves Foo calling Bar, I may want to know which invocation of Bar it was inside Foo.  The address given for Foo in the stack is the address where the jump to Bar occurred, so I can use the address given for Foo in the stack to figure out what line of code the jump occurred on.

COD files come from compiling the code with the WINCECOD environment variable set to 1.  By default WINCECOD is not set, so you will not have COD files by default.  And COD files are not shipped with the Platform Builder CDs or with the Windows CE source.  So unless you have buildable source for the OS, this work is only possible for your own code.

You may have noticed that the MAP file lists the OBJ file the function came from.  For example I_AdvertiseDeviceInterface came from the devpnp.obj file which was linked into devcore.lib.  This means the code was in “devpnp.c” or “devpnp.cpp.”  If you know where this source file is, you can go to the right directory, set WINCECOD=1, build –c, and then open the COD file from obj\%_TGTCPU%\%WINCEDEBUG%.  It’ll be named after the C or CPP file – in this case devpnp.cod.  The function name will appear several times in the COD file, but in only one case will the function name be followed by the word “PROC,” and this is the beginning of the assembly code for that function.  Here’s a sample that’s munged a little from reality.

_I_AdvertiseDeviceInterface PROC NEAR                 ; COMDAT

; 348  : {

  001e8     55          push ebp
  001e9     8b ec       mov  ebp, esp
  001eb     81 ec 14 02 00
      00          sub  esp, 532         ; 00000214H
  001f1     a1 00 00 00 00    mov  eax, DWORD PTR ___cke
  001f6     53          push ebx
  001f7     56          push esi

; 349  :     ULONG result = ERROR_SUCCESS;

  001f8     33 f6       xor  esi, esi

As you can see, the source code and line numbers are included in the COD file next to the assembly code that implements the source.  Each assembly instruction is labeled with an address.  So the COD file provides the mapping from function offset to source line.  Sometimes the assembly for the function is labeled starting with 0, and sometimes not.  (I believe it may depend on which compiler is being used – meaning it varies by CPU type.)  If not, you have to take the function start label into account.  So in this example the function starts at label 0x001E8.

So the formula for the line of assembly to look for is:

   (assembly label) = (.map file offset) – (function start addr from .map) + (function start label from .cod)

And to complete my example, subtracting the start offset of I_AdvertiseDeviceInterface 0x0000556C from the address 0x00005678, I find that the address I'm interested in is at offset 0x0000010C from the beginning of the function.  Add that to the start label 0x001E8 of I_AdvertiseDeviceInterface from the COD file, and I am looking for the assembly line labeled 0x002F4.  This line will tell me exactly what assembly code is at that address, and what line of source code that corresponds to.

More Posts Next page »
 
Page view tracker