Welcome to MSDN Blogs Sign in | Join | Help

Nicolas Besson, one of our MVPs, posted a nice series of articles about power management in Windows CE that I thought I'd bring some attention to:

 

For those of you that enjoyed Sue's excellent article CE6 OAL: What you need to know, the presentation the article draws from is now posted online at Channel9 here: Porting a CE5.0 BSP to CE6.0.

Hopefully we'll post a similar presentation about porting kernel-mode drivers in the future.

Posted by Wes Barcalow

Following on to Sue’s previous posts describing the paging pool and memory management, I wanted to talk about how drivers can be made pageable for additional virtual memory savings.

Windows CE has features to allow for more data and code to be used on a device than the available RAM.  It does this by ‘paging’ resources into RAM from fixed or read-only storage (ROM/Flash), and discarding pages if the overall amount of RAM available in the system becomes too low.  In systems where code cannot execute directly from ROM, this paging is the only available way to use storage to offset RAM usage.  This is the case for NAND Flash, which is more perfomant and of lower cost than NOR Flash (which does allow XIP or eXecute In Place).

Some code and data in the system is read (‘paged’) into RAM and ‘locked’ there – it is marked as non-pageable after it is loaded.  This code and data must be available, namely when the storage it was retrieved from is no longer available.  For example, to achieve the best power saving on entry to a low-power or deep-idle mode, it is preferable to turn off the power to a NAND Flash chip. 

Applications are typically pageable, since the operating system completely stops the threads of  applications before entering a low power mode.  At this point, since the application's code will not be executed and it's data cannot be accessed, such code and data can be ‘paged out’ and is not needed.  For device drivers things are slightly different.  Most drivers written for Windows CE / Windows Mobile are by default loaded non-pageable by device manager.  This means that no matter how big the driver is, it takes up all the RAM it wants to once it is loaded – none of it can be paged out. In the case of user mode drivers, udevice.exe loads the driver instead of device manager, but it too uses the same criteria for choosing between pageable and non-pageable modes.

With an increase of functionality or flexibility in a driver comes an increase in size.  A camera driver that supports many formats or many features may be very large.  However, if the camera is not used for a long time, then the RAM resources taken up by it is not being put to efficient use.  It makes sense to make this type of driver pageable by default instead.

To make a driver pageable, these steps have to be taken.

1)      Tell device manager you want the driver to be pageable.

2)      Tell the kernel that pageable mode is allowed.

3)      Identify and flag code that is needed to be non-pageable.

The last step is slightly more complicated than the first two steps.  Even though you may have a large driver, you may still need portions of it to be non-pageable.  The most important parts of a Windows CE / Windows Mobile driver that cannot be paged out are functions that execute when the file system is not in operation.  If you do not have such functions in your driver then you do not have to worry about making them non-pageable.

Marking a driver as pageable needs to happen in two steps; the first is with a registry setting for that driver. It may or may not already have a “Flags” registry entry. To enable the driver to be paged ensure there is a registry value named “Flags” of type DWORD entry and that in its value the DEVFLAGS_LOADLIBRARY bit is set (0x02).  If there are other flag bits set, simply logical ‘or’ this with what is already there.

Here is an example of what a GPIO driver registry setting might look like in platform.reg:

[HKEY_LOCAL_MACHINE\Drivers\BuiltIn\GPIO]

   "Dll"="gpio.dll"

   "Flags"=dword:10002   ;Trusted caller only & pageable

   ...

The second step for marking a driver pageable is ensuring the ‘M’ flag of the binary image builder file (BIB file) is not set. The purpose of the ‘M’ flag is to inform the kernel not to demand page the driver, thus forcing the driver to be completely loaded into RAM.

Here is an example of what a GPIO bib file entry might look like that allows the driver to be loaded in a pageable mode by the kernel:

msm7x00_gpio.dll $(_FLATRELEASEDIR)\gpio.dll  NK SH

 

Notice the flags at the end of the statement, there is no ‘M’ flag. A user wishing to force the driver into a non-pageable mode would use “SHM” instead of “SH”. Or alternatively, a user wishing to force the driver into a non-pageable mode would clear the DEVFLAGS_LOADLIBRARY bit in the registry. Either approach is valid.

It is also worth pointing out that a trusted user can potentially change the registry after run time, thus changing a driver from non-pageable to pageable and back again. The bib file flag, however, is built into the image and cannot be overridden. Both are viewed as equally secure as only a trusted caller can change the registry, though the bib file flag provides a predictable pageable status when loading the driver.

The final, more complicated step from above is to identify and isolate code that can’t be pageable. As mentioned above, this is code that runs in single threaded mode where the file system cannot page in or out code and data.  The most well-known examples of this are:

-          XXX_PowerUp

-          XXX_PowerDown

-          Interrupt Service Threads and Interrupt Service Routines (ISTs and/or ISRs) that may execute while the file system is inactive.

-          Read-Only constants that are accessed by these functions.

-          Any supporting code called by these functions.

-          All code associated with the file system path, as it is responsible for bringing in new pages.

Once the code is identified, it should be wrapped in compiler #pragma statements to inform the linker about the properties of the code.  Below is an example of making xxx_PowerUp and xxx_PowerDown non-pageable.

#pragma comment(linker, "/section:.no_page,ER!P")

#pragma code_seg(push, ".no_page")

XXX_PowerDown()

{

      //Perform single-threaded power off logic

}

 

XXX_PowerUp()

{

      //Perform single-threaded power on logic

}

UtilityFuncOne()

{

      // Non-Paged utility function that can be called by

      // both page and non-paged code

}

#pragma code_seg(pop)

 

UtilityFuncTwo()

{

      // Paged utility function that can only be called by other

// paged code.

}

 

This sample code shows the XXX_PowerDown and XXX_PowerUp code being marked as pageable. This will allow the processor to access this code in RAM while the file system is not in operation (during suspend and resume operations). UtilityFuncOne is also in the non-paged section of code, thus making it safe to call from within XXX_PowerUp/Down. However the UtilityFuncTwo code is outside of the non-paged area, and therefore pageable and at risk of not being available if the processor were to try to access it while performing suspend / resume operations.

To test for drivers marked as pageable that are critical to suspend, resume, and shutdown code paths the registry key PageOutAllModules can be used to instruct the kernel to page out all code. This can be used to find drivers that use pageable code when calling XXX_PowerUp and XXX_PowerDown API’s while the file system is inactive.  By generating page faults, problematic drivers can be identified more easily. Below is what the registry key looks like:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power]

"PageOutAllModules"=dword:1

 

Set this registry key and force the system to suspend, resume, or shutdown. The OS will then page out all code marked as pageable and proceed with suspend / resume / shutdown operation. If a critical driver is improperly marked as pageable then this process will generate a page fault and device will die. This technique will help ensure that all drivers a properly marked as pageable/non-pageable when preparing to release the device to market.

By making your driver pageable you can decrease the load on the system for resources while a component or feature is not being used.  It is important to take care as outlined above to make sure some important parts of your driver can still function even though in general the bulk of it is ‘paged’.

 

I didn't learn about Reed & Steve's blog until today, but got there by learning about these posts:

If you have memory issues on CE/Mobile (especially if you already know that they're virtual memory problems) you may find those useful.

 

Posted by: Sue Loh

I’d like to explain a little more about memory management in Windows CE.  I already explained a bit about paging in Windows CE when I discussed virtual memory.  In short, the OS will delay committing memory as long as possible by only allocating pages on first access (known as demand paging).  And when memory is running low, the OS will page data out of memory if it corresponds to a file – a DLL or EXE, or a file-backed memory-mapped file – because the OS can always page the data from the file back into memory later.  (Win32 allows you to create “memory-mapped files” which do or do not correspond to files on disk – I call these file-backed and RAM-backed memory-mapped files, respectively.)  Windows CE does not use a page file, which means that non file-backed data such as heap and thread stacks is never paged out to disk.  So for the discussion of paging in this blog post I’m really talking only about memory that is used by executables and by file-backed memory-mapped files.

It’s relatively easy to guess how the OS decides when to page data in to memory – it doesn’t page it in until it absolutely has to, when you actually access it.  But how does the OS decide when to remove pageable data from memory?  Ahh, that’s the question!

The Paging Pool and How It Works

Back in the old days of CE 3.0 or so (I’m not sure) – Windows CE did not have a paging pool.  What that means is that the OS had no limit on the number of pages it could use for holding executables and memory-mapped files.  If you ran a lot of programs or accessed large memory-mapped files, you’d see memory usage climb correspondingly.  Usage would continue to go up until the system ran out of memory.  Other allocations could fail; memory would appear to be nearly gone when really there was actually a lot of potential to free up space by paging data out again.  Until finally when the system hit a low memory limit, the kernel would walk through all of the pageable data, paging everything (yes, everything) out again.  Then suddenly there would be a lot of free memory, and you’d take page faults to page in any data you’re still actually using.

The algorithm is simple, but it has a few bad effects.  First, a bad effect of the simple paging algorithm was, obviously, that the system could encounter preventable RAM shortages.  Also, it was really tough for applications or tools to measure free memory – where “free” includes currently-unused pages plus “temporary” pages that could be decommitted when necessary.  Conversely, it was difficult for users to determine how much of an application’s memory usage is fixed in RAM vs. “temporary” pageable pages.   Even today it is tough to answer the question “how much memory is my process using?” in simple terms without diving into explanations of paging, cross-process shared memory, etc.  Another possible problem you can encounter when there’s no paging pool is that the rest of the system can take up all of the free memory, and leave you thrashing over just a few pages.

So we introduced the paging pool.  The purpose of the paging pool is to serve as a limit on the amount of memory that could be consumed by pageable data.  It also includes the algorithm for choosing the order in which to remove pageable data from memory.  Pool behavior is under the OEM’s control – Microsoft sets default parameters for the paging pool, but OEMs have the ability to change those settings.  Applications do not have the ability to set the behavior for their own executables or memory-mapped files.

Up to and including CE 5.x, the paging pool behavior was fairly simple.

·         The pool only managed read-only pageable data.  Executable code is read-only so it used the pool, and so did read-only file-backed memory-mapped files.  Read-write memory-mapped files did not use the pool, however.  The reason is that paging out read-write data can involve writing back to a file.  This is more complicated to implement and requires more care to avoid file system deadlocks and other undesirable situations.  So read-write memory-mapped files had no memory usage limitations and could still consume all of the available system RAM.

·         The pool had one parameter, the size.  OEMs could turn the pool off by setting the size to 0.  Turning off the paging pool meant that the OS did not limit pageable data – behavior would follow the pattern described above from before we had a paging pool.  Turning on the pool meant that the OS would reserve a fixed amount of RAM for paging.  Setting the pool size too low meant that pages could be paged out too early, while they’re still in use.  Setting the pool size too high meant that the OS would reserve too much RAM for paging.  Pool memory would NOT be available for applications to use if the pool was underutilized.  A 4MB pool took 4MB of physical RAM, no matter whether there was only 2MB of pageable data in use or 100MB.  Setting the size of the pool was a tricky job, because you had to decide whether to optimize a typical steady-state situation with several applications running (and judge how much pool those applications would need), or optimize “spike” situations such as system boot where many more pages were needed for a short period of time.

·         The kernel kept a round-robin FIFO ring of pool pages: the oldest page in memory – the earliest one to be paged in – was the first one paged out when something else needed to be paged in, regardless of whether the oldest page was still in use or not.

 

So the short roll-up of how the paging pool worked up through CE 5.x is that the paging pool allowed OEMs to set aside a fixed amount of memory to hold read-only pageable data, and it was freed in simple round-robin fashion.

In CE 6.0, the virtual memory architecture changes involved major rewriting of the Windows CE memory system, including the paging pool.  The CE 6.0 paging pool behavior is still fairly simplistic, but is a little bit more flexible.

·         CE 6.0 has two paging pools – the “loader” pool for executable code, and the “file” pool which is used by all file-backed memory-mapped files as well as the new CE 6.0 file cache filter, or “cache manager.”  This way, OEMs can put limitations on memory usage for read-write data in addition to read-only data.  And they can set separate limitations for memory usage by code vs. data.

·         The two pools have several parameters.  Primary of these are target and maximum sizes.  The idea is that the OS always guarantees the pool will have at least its target amount of memory to use.  If memory is available, the kernel allows the pool to consume memory above its target.  But when that happens, it also wakes up a low-priority thread which starts paging data out again, back down to slightly below the target.  That way, during busy “spikes” of memory usage, such as during system boot, the system can consume more memory for pageable data.  But in the steady-state, the system will hover near its target pool memory usage.  The maximum size puts a hard limit on the memory consumption – or OEMs could set the maximum to be very large to avoid placing a limit on the pool.  OEMs can also get the old pre-CE6 behavior by setting the pool target and maximum to the same size.

·         Due to the details of the new CE6 memory implementation, the FIFO ring of pages by age was not possible.  The CE6 kernel pages out memory by walking the lists of modules and files, paging out one module/file at a time.  This is no better than the FIFO ring, but still leaves us potential for implementing better use-based algorithms in the future.

 

There are some more details in our documentation under “Paging Pool” and “Paging Pool: Windows CE 5.0 vs. Windows Embedded CE 6.0.”

Overall, enabling the paging pool means that there is always some RAM reserved for code paging and we will be less likely to reach low-memory conditions.  In general it's better to turn on the paging pool because it gives you more predictable performance, rather than occasional long delays you’d hit when cleaning up memory when you run out.  But it does need to be sized based on the applications in use, which leads to my next point...

Choosing a Pool Size

In Windows CE (embedded) 5.0, the pool is turned off by default.  In Windows Mobile, the pool is turned on and set to a default size chosen by Microsoft.  I believe it varies between versions, but is somewhere in the neighborhood of 4-6 MB.  In CE6, the loader pool has a target size of 3MB and the file pool has a target size of 1MB.  Only the OEM of a device can set the pool size; applications cannot change it.

So how do you decide on the right pool size for your platform?  I’m afraid it’s still a bit of a black art.  :-(  There aren’t many tools to help.  You can turn on CeLog during boot and see how many page faults it records.  You can see the page faults in Remote Kernel Tracker, but in truth that kind of view isn’t much help here.  The best tool I know is that readlog.exe will print you a page fault report if you turn on the “verbose” and “summary” options.  If you get multiple faults on the same pages, your pool may be too small (you may also be unloading and re-loading the same module, ejecting its pages from memory, so look for module load events in the log too).  If you don’t get many repeats, your pool may be bigger than you need.  In CE6 you can use IOCTL_KLIB_GET_POOL_STATE to get additional information about how many pages are currently in your pool and how many times the kernel has had to free up pool pages to get down to the target size.  There aren’t any tools like “mi” that query the pool state, so you’ll have to call the IOCTL yourself.  On debug builds of the OS, there is also a debug zone in the kernel you can turn on to see a lot of detail about paging and when the pool trim thread is running.  But CeLog is probably a better choice to collect all of that data.

As I already mentioned, as of CE6 you can set separate “target” and “max” values for the paging pools.  I don’t really like the semantics of having a “max” – it isn’t dependent on the other usage or availability in the system.  If some application takes most of the available memory in the system, you’d want the pool to let go of more pages.  If you have a lot of free memory, and some application is reading a lot of file data, you’d want the pool to grow to use most of the available memory.  We supported the “max” as an option to limit the pool size, but I’m starting to think the best idea is to set your max to infinity, to let the pool grow up to the size of available memory.  We’ll still page out down to the target in the background.  I’d have liked to add more sophisticated settings like “leave at least X amount of free memory” but that’s quite difficult to implement.

You’ll want to examine your pool behavior during important “user scenarios” like boot or running a predefined set of applications.  If the user runs a lot of applications at once, or a really big application, or one that reads a lot of file data, they could go through pool pages pretty quickly.  There isn’t really a lot you can do about that.  We don’t even have a set of recommended scenarios for you to examine.  I wish we had more information and more tools for this, but I’ve described about all we have.

The approach I think most OEMs take is that they leave the pool at the default size until they discover a perf problem with too much paging (by profiling or otherwise observing) in a scenario that's important to users.  Then they bump it up until the problem goes away.  Not very scientific but it works, and it's not like we have any answer that's more scientific anyway.

What goes into the paging pool

This is repeating some of the information above, but in more detail – how do you know exactly what pages will use the pool and what pages won’t?  Keep in mind that paging is actually independent of the paging pool.  Paging can happen with or without the paging pool.  If you turn off the paging pool then you turn off the limit that we set on the amount of RAM that can be taken up for paging.  But pages can still be paged.  If you turn ON the paging pool then we enforce some limits, that’s all.  So this isn’t really a question of what pages can use the pool, it’s a question of what pages are “pageable.”

Executables from the FILES section of ROM will use the paging pool for their code and R/O data.  R/W data from executables can’t be paged out, so it will not be part of the pool.  Compressed executables from the MODULES section of ROM will use the pool for their code and R/O data.  If the image is running from NOR or from RAM, uncompressed executables from MODULES will run directly out of the image without using any pool memory.  Executables from MODULES in images on NAND will be paged using the pool.  (And by the way, I’m not terribly familiar with how we manage data on NAND/IMGFS so I might be missing some details here.)

Executables that would otherwise page but are marked as “non-pageable” will be paged fully into RAM as soon as they’re loaded, and not paged out again until they’re unloaded.  These pages don’t use the pool.  You can also create “partially pageable” executables by telling the linker to make individual sections of the executable non-pageable.  Generally code and data can’t be pageable if it’s part of an interrupt service routine (ISR) or if it’s called during suspend/resume or other power management, because paging could cause crashes and deadlocks.  And code/data shouldn’t be pageable if it’s accessed by an interrupt service thread (IST) because paging would negatively impact real-time performance.

Memory-mapped files which don’t have a file underneath them (a.k.a. RAM-backed mapfiles) will not use the pool.  In CE5 and earlier, R/O file-backed mapfiles will use the pool while R/W mapfiles will not.  In CE6, all file-backed memory-mapped files use the file pool.  And the new file cache filter (cache manager) essentially memory-maps all open files, so the cached file data uses the file pool.

To look at that information from the opposite angle, if you are running all executables directly out of your image – all are uncompressed in the MODULES section of ROM, and the image is executing out of NOR or RAM, then the loader paging pool is probably a waste.  You might still want to use the file pool to limit RAM use for file caching and memory-mapped files, but in that case you might want to turn off the loader pool.

Other Paging Pool Details

Someone once asked me whether the pool size affects demand paging.  It doesn’t change demand paging behavior or timing.  Demand paging is about delaying committing pages as long as possible, and it applies to pages regardless of the paging pool.  Pages can be demand paged without being part of the pool; they won’t be paged in until absolutely necessary, and then they’ll stay in RAM without being paged out.  Pool pages will be demand paged in, and may eventually be paged out again.

Another question was whether the paging pool uses up virtual address space.  Actually, no, it doesn’t.  The pool pages that are currently in use are assigned to virtual addresses that are already reserved.  For example, when you load a DLL, you reserve virtual address space for the DLL; and when you touch a page in the DLL, a physical page from the pool is assigned to the already-reserved virtual address in your DLL.  The pool pages that are NOT in use are not assigned virtual addresses.  The kernel tracks them using their physical addresses only.  The pool *does* use up physical RAM.  In CE5 it uses the whole size of the paging pool; on CE6 it consumes physical memory equal to the “target” size of the pool.  This guarantees that you have at least a minimum number of pages to page with, to avoid heavy thrashing over just a few pages when the rest of the memory in the system is taken.

Other Paging-Related Details

A related detail that occasionally confuses people is the “Paging” flag on file system drivers.  This flag doesn’t control whether the driver code itself is pageable.  Rather, it controls whether the file system allows files to be loaded into memory a page at a time or all at once.  On typical file systems like FATFS the “Paging” flag is turned on, allowing executables and memory-mapped files to be accessed a page at a time.   On other file system drivers, such as our release directory file system (RELFSD) and our network redirector, it’s turned off by default, causing executables and memory-mapped files to be read into memory all at once.  I believe the reasoning is to improve performance and minimize problems when the network connection is lost.

This flag actually derives from the original Windows CE implementation of memory-mapped files.  If the file system supported a couple of APIs, ReadFileWithSeek and WriteFileWithSeek, memory-mapped files on that file system would be pageable.  If the file system did not support those APIs, the memory-mapped files would be non-pageable, in which case they’d be read entirely into RAM at load time and never paged out until the memory-mapped file is unloaded.  The OS required pageability for special memory-mapped files like registry hives and CEDB database volumes, so file systems that did not support the required APIs could not hold these files.  (If you ask me, there is no real need to require the seek + read/write to occur in one atomic API call, so the requirement on the “WithSeek” APIs was unnecessary, but perhaps there was a good reason back in the old days.)

As I already mentioned, the new CE6 file cache also uses the paging pool.  The file cache is basically just memory-mapping files to hold the file data in RAM for a while.  The file cache is enabled by default on top of FATFS volumes.

Posted by Kurt Kennett, Senior Development Lead, Windows CE OS Core

Operating system code, as one of my colleague developers recently realized, is “just code”.  It’s not voodoo and it does not exist on a higher plane of knowledge.  In fact, an operating system kernel is usually remarkably well structured and well designed in comparison to other pieces of software.  When you think about it, it has to be.  More than one person needs to understand and maintain a core set of code that must work and must support debugging of all other software that runs upon it.  People move on, and change jobs to look for new challenges to keep learning.  If only one person understood the way an operating system worked, then there is a huge amount of risk.

One of the most interesting facets to Operating Systems that I’ve followed in my career is how they start.  Initialization is the last step in design, but at the same time it uncovers the most fundamental bedrock of the principles used.  You start with literally nothing but a CPU which can execute instructions (sometimes not even with memory to use), and must take a platform from that point to a fully functioning system - one that not only utilizes available hardware, but abstracts it to a common understanding.

What I’m going to do in this article is discuss the details of how the Windows CE 6.0 kernel starts, and the association of the ‘Microsoft’ kernel code with the code that comes from an Original Equipment Manufacturer (OEM).  It is hoped that by relating this understanding more people will have a better idea of the ‘hows’ and ‘whys’ of the Microsoft design.

To start, let’s quickly review how the operating system software is built.  The Microsoft tool chain emits .EXE and/or .DLL program files.  Files of both these types of extension are “Portable Executable” format, or “PE” format.  They are practically identical in every aspect:

  • They are extended Common Object File Format (COFF) format files
  • They have import tables and export tables (EXE export tables are usually blank)
  •  They have an entry point defined in their headers for where execution should start

There is nothing extraordinary about the operating system kernel program – it is compiled using the standard compiler and with a minimal set of definitions for that compiler.  An EXE file is produced (called NK.EXE).  It does not link to any external library or DLL – it can’t.  When this code starts there is nothing in the system, or even a system for that matter.  Since the EXE is in a known format (PE COFF), you can determine the entry point from looking at the EXE header.  This means we know where to set the CPU’s instruction pointer to so that the program can start.

One additional property is that a PE file can be arranged so that it may “execute in place”.  This means that if the file data is placed at a particular virtual address, no changes need to be made to the program code in the file in order for it to address other code and data at the correct addresses.  For example, I can tell the Microsoft linker program to place the kernel program file at the virtual address 0x80000000.  Then references to code (function entry points) will be placed in the EXE file such that other code can jump to them by address.  If function foo() is at address 0x80001000 and inside its body it calls a function bar() which sits at address 0x80005000, there will be an instruction stored directly in the program code for ‘foo()’ that calls to 0x80005000.  The dotted lines are just the delineation of function code start or end.

If the EXE program file for the kernel could not sit at 0x80000000 and had to be moved, the ‘bar()’ function would move with it and the call instruction in ‘foo()’ would have to be changed to have the correct, new address.  Otherwise it would call to the wrong place:

 

You can see in the example above that if the kernel EXE file that is designed to be placed at 0x80000000 is loaded at 0x80050000 instead, the instructions in the program will be incorrect.

The process of changing an EXE or DLL program file after it has been loaded to reflect the actual load address is called “fixing up”.  Records are placed in a standard EXE file which allow the program file to be fixed up.  However, until the fixup process is done the addresses of functions in the EXE will be incorrect.  To get around this, Windows CE kernel EXE files are fixed up beforehand to be loaded at a specific address.  A program called ROMIMAGE actually pre-processes the kernel EXE file and some DLLs that are used and fixes them up when it builds the operating system image file (NK.BIN).

To recap, we get a fixed-up EXE file which is called NK.EXE which contains portions of the operating system kernel.  This EXE has an entry point defined in it, the same as every other COFF EXE or DLL.  For execution to start, the bootloader for the system is supposed to put the image file at the right address, find this EXE entry point and jump to it. The bootloader is a separate discussion, and its startup and execution is very platform-specific.  For the context of this article, we will simply assume that a bootloader places the OS image file into memory at a specific address.  We will see below how the bootloader can find the NK.EXE file within the image and then find its entry point.

The NK.EXE is only part of the Windows Embedded CE 6.0 kernel – it comprises the OEM Adaptation Layer (OAL) and boilerplate code to start the system.  The main portion of the operating system kernel that does all the process, thread and memory functionality lives in a Microsoft-supplied DLL called ‘kernel.dll’.  This is a DLL which is also ‘fixed up’ by the ROMIMAGE program to live at a specific virtual address in memory.  So this means there are at least two executable modules that we need to know the location and the entry point of.  The entry point address is stored inside the EXE or DLL file, but what about the location of the EXE and DLL files inside the image?

Windows CE images have an important structure set up by ROMIMAGE that is placed into the image file, called the “Table Of Contents”, or TOC.  This TOC holds pointers and metadata for the operating system image file.  Somewhere near the beginning of the image file a marker is placed – the bytes “CECE” (0x44424442).  Right after this marker is placed an offset to the TOC.  This allows a bootloader or other program looking at the file to be able to find information about the image.  In addition to this offset value that is prefixed by a marker, the OAL must define a public symbol called ‘pTOC’ (exported using ‘C’ naming conventions), which ROMIMAGE can find and fill in with the virtual address of the TOC when it prepares the image file.  When compiled, the pTOC variable in the NK.EXE must have the value 0xFFFFFFFF.  When it prepares the NK.BIN OS system image, ROMIMAGE does the following (in addition to other tasks):

  1. Load NK.EXE and fix it up.
  2. Make the TOC and find a place for it in the image file (will live in virtual memory when the os image is loaded).
  3. Find the ‘pTOC’ variable in the NK.EXE file and make sure it has the current value 0xFFFFFFFF.
  4. Set the pTOC variable value to the virtual address of the TOC that was created in step (2).

This way, when the NK.EXE starts it can reference this variable to know where the TOC is.  Using the TOC, the program can find all the other pieces of the operating system image.

ROMIMAGE uses the configuration .BIB files to know where the image is supposed to go and where RAM is.  There are two important parts of the CONFIG.BIB file – the RAMIMAGE and the RAM lines.  Here is an example from the Device Emulator’s CONFIG.BIB:

    NK      0x80070000   0x02000000    RAMIMAGE
    RAM     0x82070000   0x01E7F000    RAM

These entries tell ROMIMAGE what to do.  It knows to place the OS image file at 0x80070000, and that it can start using read/write memory at 0x82070000.  With this information it can place modules such as NK.EXE and KERNEL.DLL into virtual memory, and then build a TOC and put that into the image as well.  To help the kernel start, the TOC also contains information on where RAM is.  A more detailed look at what is in memory when the image file has been placed is shown below:

In order for the actual operating system to start, the bootloader needs to:

  1. Put the image file at the right place in memory.
  2. Find the “CECE” marker.
  3. Use the TOC pointer that comes right after it to find the TOC.
  4. Search the TOC for the “NK.EXE” file entry.
  5. Scan the EXE file to find its entry point (it is a standard PE format file).
  6. Jump to the address that corresponds to the entry point.

The really interesting stuff happens once the NK.EXE program is started.  In broad strokes, it has its own tasks to perform:

  1. Set up virtual memory and turn it on.
  2. Gather important information that the KERNEL.DLL will need to use to run the system.
  3. Use the pTOC to scan the TOC for the KERNEL.DLL file inside the operating system image.
  4. Find the entry point of KERNEL.DLL (it is a standard PE format file).
  5. Pass critical information gathered in (2) to KERNEL.DLL in a call to its entry point.

We will walk through these activities in detail to better understand them.  Some parts of the startup process are CPU-type-specific.  For instance, the ARM CPU and the X86 CPU have different virtual memory management hardware and mapping structures.  However, to keep things consistent a general process is maintained.  Whenever possible I will attempt to call out any operations specific to an architecture.

When the NK.EXE starts, there are a few prerequisites of the system:

  1. All caches are disabled
  2. The entire RAMIMAGE and RAM regions specified in the CONFIG.BIB file are physically addressable and readable. 
  3. Virtual Memory is in a predefined state (CPU typically executes in physical address mode).

An additional prerequisite can be satisfied before NK.EXE starts, or can be done in the very beginning of NK.EXE execution:

    4. RAM should be writeable without any supplemental configuration (for example, of a memory controller).

These assumptions allow the NK.EXE startup code to do what is necessary to bring any particular system up, and not have to worry about some things being done and others not being done.  Point (3) above may be counterintuitive, but since the kernel must be entirely self-contained, it does not make sense for it to rely on the bootloader to configure virtual memory properly before it starts.  This ‘decouples’ the OS from whatever bootloader is used to start it.

When it starts executing instructions in physical address mode, the first action taken by NK.EXE is to calculate the physical address of the OEMAddressTable symbol.  This is a table that is built into the kernel that defines the static (unchanging) default regions of virtual memory. NK.EXE knows:

  1. It’s own location in virtual memory (where it will be executing instructions)
  2. It’s own location in physical memory (where it currently is executing instructions)
  3. The virtual address of the OEMAddressTable (it was determined when the NK.EXE was built and subsequently fixed up by ROMIMAGE).

Using this information, a simple calculation tells it the physical address of OEMAddressTable:

NK::PhysicalBase + (NK::Virtual OEMAddressTable – NK::Virtual Base) è NK Physical OEMAddressTable 

The OEMAddressTable has triads of DWORDS making up a line in a table, with the following format: 

<region virtual start>          <region physical start>       <region size in MB>

<region virtual start>          <region physical start>       <region size in MB>

...

From the information found in this table, the NK.EXE program can set up the virtual memory mapping tables for the Memory Management Unit (MMU) to function.  Where the MMU-formatted mapping tables are kept and what they look like is platform-specific – the OEMAddressTable is a simplistic format that works for any architecture.  Virtual memory is set up using the data in the OEMAddressTable and enabled, and then the NK.EXE transitions to the virtual address where it can execute code.

One thing to note at this point is that anything that is supposed to be in RAM that needs to be pre-initialized (set to zero or some other known value) is not yet available.  RAM is still a clean slate and can have any contents whatsoever.  The initialization values in the image file (the .data sections of NK.EXE and other modules) for read/write data must be copied from the image to actual RAM addresses before they can be properly used.  How does the NK.EXE know what to copy or where to place things in virtual RAM for these modules?  The TOC.

The TOC not only lists the start addresses of all modules in the image, but it also describes RAM and where the read/write portions of each module are to be located so that the kernel can work with them.  Pieces of the OS image that need to be copied to RAM are called “copy entries”.  Before the NK.EXE can access its own read/write variables, it needs to copy the copy entries to RAM.  This begs the question – the pTOC is a variable, isn’t it?  How could the NK.EXE know where the pTOC is if it hasn’t been set up?  The answer is that the pTOC is a read-only variable – only ROMIMAGE writes to it when the image file is created.  The storage for pTOC is not located in RAM, and does not need to be copied before its value can be used.  The function inside NK.EXE that copies all the copy entries described by the pTOC to RAM is typically called “KernelRelocate()”.  It is a simple process of going through a simple table of structures and copying ranges of virtual memory from one place to another.  Once it is finished all NK.EXE variables can be read from or written to just like any other program.

 

At this point we have a working program, just like any other program for decades past.  It executes instructions, can call functions, and can read and write memory locations.  There are no threads, no processes, and no operating system constructs, but everything is placed in a known location and can be accessed to let us do the rest of the startup of the higher-level systems.

Virtual Memory allows a tremendous amount of flexibility.  Windows CE reserves a few regions of the virtual address range for its own private use inside the OS kernel.  There are several ranges of 4k ‘pages’ of virtual memory that are set aside in the highest address ranges, from about 0xFFFE0000 upwards.  The kernel maps some physical memory into this range to store its ‘global’ dynamic data.  Some of this memory can be used for memory mapping tables for an architecture-specific MMU.  Some is reserved for the kernel-mode and interrupt stacks. Most importantly, at least one of the 4k pages is reserved specifically as a ‘Kernel Data Page’.  This page contains a plethora of data fields which is specific to a version of the kernel.  The NK.EXE sets up the location and initial contents of this page directly. 

Three important values stored in the structure by NK.EXE:

  1. A copy of pTOC
  2. The address of OEMAddressTable. 
  3. The address of the function OEMInitGlobals()

The first two pieces of information are placed in the Kernel Data Page so that any code that knows the address of the Page can find what is in the OS image and the basic layout of virtual memory.  The last piece of information is specifically used so that the NK.EXE contents can be used once control has been passed to KERNEL.DLL.  In general, the contents of the reserved portion of virtual memory looks like:

 

Now that the Kernel data page has been initialized and virtual memory is active, we can jump into the Microsoft KERNEL.DLL executable’s entry point.  Remember, we can find the KERNEL.DLL file in the image by using the TOC, and then we can scan for the entry point of the module.  Even though NK.EXE knows where it is going to put the kernel data page in virtual memory beforehand, the KERNEL.DLL cannot assume its location.  Therefore, we pass the virtual address of the kernel data page to the entry point of KERNEL.DLL.  Although the Microsoft code can call back into the NK.EXE function addresses, control is never fully restored to the NK.EXE program.

After the jump, we are now executing Microsoft kernel code.  The code at the entry point is given the address of the Kernel Data Page, and through its fields the TOC to know anything it needs to know about the OS image.  The kernel does some basic setup of its own and sets some critical data fields for its own use into the Kernel Data Page.

The KERNEL.DLL has a static table of functions and data, called “NKGlobals”, which is built into its DLL simply as a static data structure.  Since the KERNEL.DLL is fixed up by ROMIMAGE to run from a particular virtual address, the function pointers in the NKGlobals will be correct when the KERNEL.DLL code starts to run.  Some of the functions pointed to by this structure are ones like SetLastError() and NKwvsprintfW().  These are routines that the NK.EXE is allowed to call directly.  However, it is important to note that at this point the NK.EXE does not know where these functions are in KERNEL.DLL – it still needs to be told where this table of functions and data is inside KERNEL.DLL .

The KERNEL.DLL passes the address of “NKGlobals” back to NK.EXE in a function call to OEMInitGlobals(), the address of which was left in the Kernel Data Page.  So, in essence the function call graph looks like this:

 

As shown above, the OEMInitGlobals() function stores a pointer to the NKGlobals structure that resides in KERNEL.DLL.  After it stores this pointer, NK.EXE can use it to find the addresses of the KERNEL.DLL functions it is allowed to call.

OEMInitGlobals also passes back (via function return value) a pointer to its own structure, called “OEMGlobals”.  This structure is critical to the kernel to get access to all the functionality that is platform-specific that is inside NK.EXE.  The KERNEL.DLL module is constructed so that it will run on any processor belonging to a certain architecture (X86, ARM, etc).  The NK.EXE is the abstraction of a specific species of the architecture (such as XSCALE or OMAP processor) and the platform that supports that architecture.  The OEMGlobals structure is comprised of function pointers and data just like NKGlobals.  Some of its members include:

  • PFN_InitDebugSerial(), PFN_WriteDebugByte(), PFN_ReadDebugByte()
  • PFN_SetRealTime(), PFN_GetRealTime(), PFN_SetAlarmTime()
  • PFN_Ioctl()

These function pointers point to the legacy OEM functions like OEMInitDebugSerial and OEMIoctl that live inside NK.EXE.  Many other functions are listed so that KERNEL.DLL can do what is necessary for a particular platform.  The functions are fairly self-explanatory in name and are well documented on MSDN.

Once the call to OEMInitGlobals() completes, the KERNEL.DLL has everything it needs to do architecture-generic and platform-specific processing.  It knows where memory is and how it is laid out virtually, as well as the location of every module in the image.  The NK.EXE also has a pointer to a table of functions it can call.  In essence, the two code modules have executed a manual ‘handshake’ by executing a simplistic method of manual dynamic linking.

Everything up to this point that NK.EXE and KERNEL.DLL have done has been done without any processes or threads, and without any kernel services running. To bring the rest of the system up, the KERNEL.DLL has to do three things:

  1. Architecture-specific setup
  2. Architecture-neutral setup
  3. Platform-specific setup (specific CPU and BSP initialization)

The architecture-specific setup is done first by a call to a KERNEL.DLL function called <architecture>Setup.  On an ARM platform this would be called ARMSetup().  On an X86 platform this would be called X86Setup().  The actions taken by the architecture-specific code are numerous, but they all execute in a single-threaded context with no processes running.  The actions taken here include but are not limited to:

  • Set up hard required page tables and reserve VM for kernel page tables
  • Update cache information in Page Tables
  • Flush the Transition Lookaside Buffer (TLB)
  • Set up architecture-specific buses and components (companion chips, coprocessors, etc).

The one other thing this architecture-specific code does is set up the Interlocked API code so that NK.EXE knows where it is and can call it.  This is a bit of an aside, but I will explain in detail because it is a critically important piece of the OS.

Even at the most basic level, Windows CE needs to coordinate actions among different threads of execution – even some that run inside the kernel, outside the scope of any specific process.  The mechanism used to do this with the highest amount of efficiency is the Interlocked API.  The API consists of a handful of functions, the most important of which is InterlockedCompareExchange().  The purpose of this function is to:

  1. Read a memory location (M) into register (R)
  2. Compare the value read (R) with a match value in another register (R2)
  3. If (R) and (R2) are not equal, exit
  4. Write the value of another register (R3) back to memory location (M)

These four steps are meant to execute atomically, and they form the basis of coordination between different threads.  That is, there should be no interruption between each of (1), (2), (3) and (4).  The only way to guarantee this on some of today’s processors where the operation is not available directly in hardware is to ensure interrupts are disabled.  Herein lies a problem, since user-mode processes do not have sufficient privilege to disable interrupts, and it would be very inefficient to have to do a system call to the kernel and disable interrupts every time two threads wanted to coordinate with each other.

To be efficient, there is one single place in the entire system where the InterlockedCompareExchange() happens.  The code for the four steps above is placed in the Kernel Data Page, at a particular location that is well known.   Then the NK.EXE and KERNEL.DLL (and any process which has the Kernel Data Page mapped) can call the code, and the instructions all occur in the same place.  This is done so that the API is restartable.  What does this mean?  Why do we do this?

Thread switches in an operating system can happen for three reasons:

  • It has been specifically requested by the executing thread
  • The thread’s time-slice has expired (noted by a timer interrupt event) and it is another thread’s turn to run. 
  • Another type of interrupt occurs, which causes a situation where a thread of higher priority should execute.

The second two cases are really the same – an interrupt occurs that ultimately causes a thread switch.  Since an interrupt can occur between any of the steps (1) to (4) and potentially switch out the thread, the operation we needed to be atomic might not be – some other thread might run in between (2) and (3), for example. 

To ensure that the instructions (1) to (4) occur atomically, every time there is an interrupt a simple bounds check is made to see if the CPU was currently executing somewhere in (1) to (4).  If the interrupt occurred when the CPU was executing after (1) and before (4), then the instruction pointer for the current thread is reset to point to instruction (1), so that the operation may be retried.  In order for the interrupt code to be able to check if the CPU was executing in between (1) and (4), the code for it must be in a single known location.  That location is inside the Kernel Data Page.

 Once the Interlocked API code has been copied to the Kernel Data Page, the NK.EXE knows where it is and can coordinate actions with KERNEL.DLL when multiple threads become active – ultimately by using the Interlocked API.

 Back onto our main discussion, the next step in the KERNEL.DLL startup is the architecture-neutral setup.  One of the first architectural-neutral things to set up is to see if the OS image includes a KITL.DLL to allow communication with and debugging of the OS kernel.

 KITL stands for “Kernel Independent Transport Layer”.  This is basically a mechanism by which data ‘packets’ specific to the Windows CE system can be passed between the kernel of the device and Platform Builder running on the desktop.  Usually, the portions of KITL which are implemented in NK.EXE purely revolve around the encoding for transport and the transport of the data packets.  A Board Support Package (BSP) does not have to know anything about the data being sent and received between the device and the desktop – it just has to facilitate the correct transmission and reception.  Mechanisms for transport of the KITL packets include but are not limited to RS232 Serial, Ethernet, and USB.  A full description of KITL is beyond the scope of this blog article.

 Other actions that happen during the architecture-neutral setup include:

  1. Initialize Kernel Debug Output (by calling OEMInitDebugSerial() through the function pointer in the OEMGlobals structure)
  2. Write a masthead debug string (“Windows CE Kernel Version xxxx”) to the debug output.
  3. Select the kernel processor type from the available options

When the architecture-neutral portions have been completed, we can do the platform-specific setup.  This code lives in NK.EXE since it is OEM and board specific.  To initialize this part, the kernel calls into OEMInit() through the function pointer that is in the OEMGlobals structure.  OEMInit does board-specific initialization, and can do one other important thing – start KITL.

If KITL is built into the NK.EXE, then its functions are directly accessible from NK.EXE.  If KITL is in a DLL, then that DLL will have been loaded by the kernel at the beginning of the architecture-neutral setup, as shown above.  In either event, the OEMInit() function can call a Kernel IO control saying that KITL should be started.  Based on whether the KITL.DLL was found or not, the kernel knows what to do.

Upon return from OEMInit(), the kernel is ready to start processes and threads to run.  It synchronizes its cache, and then enters the processor architecture’s service mode if it is not already running in it.  Then it does any one-time inits that do not require a current thread. These actions include:

  1. Enumerate available Memory  (optional call to OEMEnumExtensionDRAM() )
  2. Initialize critical sections in the kernel (critical section code uses the Interlocked API, the setup of which was discussed above).
  3. Initialize heap structures
  4. Initialize process and thread tracking structures
  5. Any other actions done before multi-threading is enabled.

After all single-threaded initialization is done, the kernel is ready to schedule the first thread.  This first thread is called “SystemStartupFunc()”, and lives in KERNEL.DLL.  To start the thread, the kernel specifies that there is no current thread to switch from, sets the first thread as the only one available to run, then calls into the thread scheduler code.   The scheduler code takes a look at all available threads and chooses the next one to run.  At this point in startup we only have one thread that has been manually set up to run, so that one is the one that is switched to.

 The SystemStartupFunc() function begins execution by flushing the system cache, then does things that require a ‘current’ thread to be running in order to happen.  These actions include:

  1. Initialize the system loader
  2. Initialize the paging pool
  3. Initialize system logging
  4. Initialize system debugger

The SystemStartupFunc() will call one more OEM function before it completes initialization – it will call the OEMIoctl() function through the function pointer in the OEMGlobals, with an argument ‘OEM_HAL_POSTINIT’.  This tells the NK.EXE that all system startup has completed and we are about to schedule threads and proces