Welcome to MSDN Blogs Sign in | Join | Help

Microsoft TechEd just announced they'll be holding a preconference session for Windows CE on Sunday, May 10th 2009.  The content will be 300 and 400-level presentations by CE experts, and you'll get a SPARK kit for registering.  For more information or to register, go here:

http://www.microsoft.com/windowsembedded/en-us/news/events/teched.mspx

In case you can't make it but want to get a SPARK kit anyway: http://www.microsoft.com/windowsembedded/en-us/products/spark/default.mspx

Some CE-related presentations from ESC Boston are now publicly available.  Topics include building a real-time system, the debugger, and the CE build system.  You can find the presentations here: http://msdn.microsoft.com/en-us/embedded/dd253223.aspx

Some of the demos will also be made available as labs that you can do on your own.  I'll update this post with a link once the labs are posted.

Update Feb 2009: Sorry, we decided not post the labs after all due to cost issues.

Just a quick note that a detailed presentation about driver porting in CE6 is now available on Channel 9:

http://channel9.msdn.com/posts/TravisHobrla/Porting-Drivers-to-Windows-CE-60/

This presentation was developed by Juggs Ravalia and myself and has been floating around technical conferences (like MEDC) for a couple years.  Now it is finally available online!

Posted by: Russ Keldorph

In my previous post, I talked about how structure packing works.  Now I’d like to talk about when and why it’s commonly used as well as why you may or may not want to use it.  Let me start out by saying that by "structure packing" I'm referring to the use of the /Zp compiler switch or #pragma pack directive to make the packing of a structure something other than the default.  For example, using #pragma pack(2)around a structure containing an int type modifies the structure default packing of 4.  Alternatively, #pragma pack(1) around a structure containing only char (1-byte) types has no effect and is (technically) harmless.

Why use packing?

People usually use structure packing for one of two reasons:

1.       they want to save space in data structures, or

2.       they want to format a stream of bytes into fields according to some existing specification like a network protocol.

These can both be valid reasons, but, more often than not, the implications of a decision to use packing are not fully understood, leading to unforeseen side effects that can, in some cases, have long-term negative consequences.  The point of this post is to identify the costs of packing and suggest best practices around its use.

First, let’s look at a common example of how packing affects code generation.  Take the following C++ code compiled for all four architectures supported by Windows Embedded CE.

// To compile: cl –c –O2 t.cpp –DPACKING=<packing size>

#pragma pack(push, PACKING)

struct S {

    char i8;

    int i32;

};

#pragma pack(pop)

 

int extract(S * ps) {

    return ps->i32;

}

The following table lists the sequences of code required to load the i32 member of S.  Remember that when PACKING=4, padding is inserted such that the i32 member’s offset from the beginning of S is a multiple of its alignment (4).  When PACKING=1, i32’s alignment becomes 1, so no padding is inserted.

 

PACKING=4

PACKING=1

ARM

ldr     r0, [r0, #4]

ldrb    lr, [r0, #1]!

ldrb    r3, [r0, #1]

ldrb    r2, [r0, #2]

ldrb    r1, [r0, #3]

orr     r3, lr, r3, lsl #8

orr     r3, r3, r2, lsl #16

orr     r0, r3, r1, lsl #24

MIPS

lw      v0,4(a0)

addiu   t0,a0,1

lwl     v0,3(t0)

lwr     v0,0(t0)

SuperH

mov.l   @(4,r4),r0

add     #1,r4

mov.b   @(3,r4),r0

mov     r0,r3

mov.b   @(2,r4),r0

shll8   r3

extu.b  r0,r2

mov.b   @(1,r4),r0

or      r3,r2

extu.b  r0,r1

mov.b   @r4,r0

shll8   r2

or      r2,r1

shll8   r1

extu.b  r0,r0

or      r1,r0

x86

mov     eax,dword ptr [eax+4]

mov     eax,dword ptr [eax+1]

 

Notice how the difference packing makes depends a lot on the architecture you’re targeting.  For the RISC targets (ARM, MIPS, SH), the compiler must assume that the i32 member is misaligned and must generate special code since normal 4-byte load instructions do not work in that case.  In terms of code size, SuperH and ARM suffer the most since they have to load one byte at a time and combine them with a series of shifts and logical ORs.  MIPS is quite a bit better with its special “left” and “right” load instructions, and x86 isn’t affected at all since the CPU supports misaligned addresses for most memory accesses.  I don’t want to speculate too much, but it’s possible that the reason structure packing is so popular is that x86 is so popular.  If more people had to target SH-4, they’d think twice before packing their data types.  Oh, and one thing I should mention is that the 8-bit i8 member isn’t really necessary for this discussion.  Even if it were absent such that i32’s offset from S were zero (0), the generated code would be almost identical.  This is because packing works by modifying the alignment of members.  It’s the alignment of the member, not its offset, which determines how the compiler accesses it.

Saving space

Let’s now take a look at the first reason you might want to use structure packing: to save space.  It’s true that the structure above with PACKING=1 is smaller than the structure with PACKING=4.  The sizeof operator indicates 5 bytes for the former and 8 bytes for the latter.  This might lead one to believe that all data should be packed.  However, if you look at the impact on code size, the benefit is not so obvious.  The code required for each access to misaligned data can be much more than for a normal access, and that is  multiplied by the total number of accesses across the code base.  In one case I know of, a colleague removed a #pragma pack(1) from the main header of his ARM DLL, reducing its size from 300kB to 200kB.  Remember that data is often temporal, i.e. it comes and goes and space for it isn’t always allocated.  However, code will usually live for the entire lifetime of a process, and can also take up space indefinitely in ROM or on disk.

In short, make sure you take into account the code size implications if you think packing will save space.  Make sure you know the performance impact as well.  It should come as no surprise that the ARM and SuperH sequences for misaligned accesses are slower than the aligned sequences.  However, even the x86 sequence is usually slower if the memory is misaligned, because modern CPUs have to access both of the enclosing (aligned) words in order to access a misaligned word.

Recommendation: Instead of packing to save space, consider reordering your data structures so that larger members always precede smaller members (or, rather, more-aligned members precede less-aligned members).  That way, you will have little or no padding except possibly at the end of the structure.  Padding at the end of a structure affects array allocations, but little else.

Matching byte stream formats

The other common reason people use packing is to implement network protocols or to parse byte streams.  Packing can make it more convenient to write code for certain data formats.  Take this (made up) packet format as an example:

 

Signature

(16-bit)

 

 

Size

(32-bit)

 

Protocol

(16-bit)

 

Checksum

(32-bit)

 

Payload

(N-bit)

 

If we were to declare the structure like this:

struct packet1 {

      unsigned short signature;  // offset 0

      unsigned long size;        // offset 2 or 4?

      unsigned short protocol;   // offset 6 or 8?

      unsigned long checksum;    // offset 8 or 10 or 12?

      unsigned char payload[1];

};

by default, the compiler will insert padding between the signature and size fields in order to maintain the latter’s alignment (4).  One solution to this would be to use #pragma pack(2), which would remove the need for padding.  In some cases, this might be the right thing to do, particularly if the alignment of the beginning of the packet is at most 2-byte.  But wait, as you may have noticed, the offset of the checksum member is a multiple of its natural alignment.  That means that if the beginning of the structure is aligned, it can be accessed safely with a normal 4-byte load or store.  However, if we use #pragma pack(2), the alignment of all fields is capped at 2-byte, forcing the compiler to load it with at least two instructions for most architectures.

What if we can ensure that the beginning of our packet buffer will always be 4-byte aligned?  Is it possible to match the packet format while still loading all fields as efficiently as possible?  Yes, if you’re willing to write a little more code.  One option is to replace the size field with two smaller fields with less strict alignment requirements:

struct packet2 {

      unsigned short signature;  // offset 0

      unsigned short sizeLow;    // offset 2

      unsigned short sizeHigh;   // offset 4

      unsigned short protocol;   // offset 6

      unsigned long checksum;    // offset 8

      unsigned char payload[1];

};

Now we have what we want in terms of layout.  In fact, this is or is similar to what we would have to write if we didn’t have the ability to pack structures at all.  The problem is that now we have to write extra code to get at the size member, which is the main reason we wanted to use packing in the first place.  The key to fixing this is to realize that we just need to reduce the alignment requirement of the size member.  How?  One option is to use #pragma pack.

#pragma pack(push,2)

struct u32_a16 {

      unsigned long u32;

};

#pragma pack(pop)

struct packet3 {

      unsigned short signature;  // offset 0

      struct u32_a16 size;       // offset 2

      unsigned short protocol;   // offset 6

      unsigned long checksum;    // offset 8

      unsigned char payload[1];

};

Note that we have to encapsulate the scalar unsigned long type in a structure because #pragma pack doesn’t affect scalars that are not members of a structure.  The one drawback to this is that, in C, we have to write a little extra code to access the size member, i.e. we’d have to write p->size.u32 instead of just p->size.  You could perhaps hide this overhead in an accessor function.  In C++, however, you can add a little syntactic sugar to make the code look just like we want:

#pragma pack(push,2)

struct u32_a16 {

      inline unsigned long operator=(const unsigned long &that) {

return this->u32 = that;

      }

      inline operator unsigned long() { return u32; }

      unsigned long u32;

};

#pragma pack(pop)

Now the compiler can generate the most efficient code for aligned fields and correct code for the misaligned ones.  Remember, though, if the entire structure may not be aligned, you’re probably best off packing the whole thing since the compiler needs to generate unaligned access code for everything anyway.

Other tips about packing and alignment

·         Be careful when taking the address of a field in a packed structure.   If you assign it to a “normal” pointer, the compiler will lose the fact that it is misaligned.  For example:

 

struct S sample;        // struct from above

int * pi = &sample.i32; // alignment information lost

*pi = 4;                // DATATYPE_MISALIGNMENT exception

 

This can be particularly confusing when including an unpacked structure inside a packed structure.  The compiler has a warning (C4366) to attempt to detect this practice, but it’s not completely reliable. 

·         Try to avoid using packing in public interfaces that have (or will have) backward compatibility requirements.  Even though packing may seem beneficial now, it’s likely that it could be harmful in the future, particularly if the interface is implemented on a different architecture.  It's ok to use #pragma pack in a header file to protect it from other users (see below), but the packing value should be the compiler default (8).

·         If you must use #pragma pack in a header file, be careful not to let it “leak” out and affect structures you never intended.  Always use the push/pop features like you see above, and try to limit the packing scopes to just around the structures you care about.  The latter practice helps avoid someone unintentionally creating packed structures when adding types to your header.

·         Be very wary of One Definition Rule (ODR) violations with packing.  Defining the same type under different packing values in different translation units can lead to bugs that are very difficult to track down.

o   Always define your types in a single header and include that wherever you need it.

o   Don’t #include headers under #pragma pack

o   Use #pragma pack(push,8) at the beginning and #pragma pack(pop) at the end of your headers to protect them from /Zp switches and other people including them under #pragma pack

Conclusions

Packing can be a useful feature, but like many useful features it needs to be understood fully in order to avoid misuse.  Always test your assumptions about packing before making a decision to use it.  “Premature optimization is the root of all evil.”

As always, feel free to ask questions.  I hope my next post will come sooner than this one did. J

 

When I was a developer, and customer, using MSDN in my day-to-day work, I occasionally found myself frustrated by document discoverability. MSDN often had the information I was looking for -- sometimes in multiple formats -- but finding just what you want in MSDN can be quite a task.

We're working to improve this situation for a number of critical scenarios, including device bring-up. One important task for board support package (BSP) developers is porting a BSP from a previous version of Embedded CE to Embedded CE 6.0. Luckily, BSP porting information exists in a number of places.

First, the MSDN Library contains information on porting BSPs, starting at: http://msdn.microsoft.com/en-us/library/aa917748.aspx. You can also find information on porting device drivers, another key device bring-up task, at: http://msdn.microsoft.com/en-us/library/aa931071.aspx.

Channel 9 has an excellent talk by Travis Hobrla of the Embedded CE team on the process of porting a BSP from Embedded CE 5.0 to CE 6.0:  http://channel9.msdn.com/posts/mikehall/Porting-a-CE-50-BSP-to-CE-60-Travis-Hobrla/.

Doug Boling gave a great presentation on the new CE 6.0 kernel which includes porting information at MEDC 2006; you can find that presentation here: http://download.microsoft.com/documents/australia/medc2006/Windows_CE6_Architecture_Boling.ppt.

Please let us know if there are other crucial scenarios about which you're trying to find information! 

I've been working as a technical author and editor since 1995.  I've worked at Microsoft as a Programming Writer since 2006.  I joined Embedded CE and Windows Mobile developer documentation team at the beginning of 2008, where I've worked on documentation for file systems and storage, the kernel, device bring-up, power management, and other Core OS functionality.

In this blog, I'll discuss the Embedded CE and Windows Mobile developer documentation (on MSDN and elsewhere) as it pertains to these areas.  Any feedback on our developer docs is welcome, and appreciated! :)

Posted by: Sue Loh

Hello out there, it's been a long time since I posted anything real, and I feel sorry about that.  As I began writing this article, I had just come from the first day of TechEd where I saw my colleagues present about CE6 and drivers, and was reminded of a subject I was suddenly inspired to write up for you all.  Today is now the last day of TechEd and I'm back home, but my comments still apply.

I'll let you in on something - not so much of a secret.  We all make mistakes.  And this is a blog post about one of my own.  You may have already read about the marshalling APIs on this blog, or otherwise learned of them.  When we designed these APIs, we planned them to hide away complexity in the decisions we made for performance and security reasons - so that OEMs and driver writers would not have to thread a maze of difficult details.  With that in mind, consider the CeAllocAsynchronousBuffer API.  The purpose of this API is to marshal a buffer into a driver's (or server's or service's) process space such that the driver/server/service could access the buffer asynchronously.  The work required to do the marshalling depends on the circumstances.  In kernel mode it probably just needs to be aliased (VirtualCopied) into the kernel, while in user mode it must be duplicated (memcpy'd).  The work also depends on what work CeOpenCallerBuffer might have done beforehand - for example if it is already duplicated into the process.  So, CeAllocAsynchronousBuffer hides all of these details.  You can call it and trust the API to make the right choices for security and perf.  We designed it to hide these details while asking the caller to make no assumptions about what's going on underneath.  Use CeFlushAsynchronousBuffer to guarantee changes have been written back, and CeFreeAsynchronousBuffer to do that plus release any resources.

So that's all well and good.  Enter older ARM CPUs and their virtually-tagged caches.  In the early days of CE6, we hadn't quite come to terms with how to prevent the cache coherency problems you could get if you aliased/VirtualCopied memory.  In later days, we fixed aliasing so that it would make both source and dest buffer uncached for the duration of the alias.  (Specifically, we fixed VirtualAllocCopyEx, NOT VirtualCopy, since I am a stickler for little details.)  But in the early days, when we built the marshalling APIs, we were concerned about cache coherency.  So at that time, in CeAllocAsynchronousBuffer we made ARM virtually-tagged CPUs duplicate the memory instead of alias it.  This, of course, concerned us greatly about performance, and we knew we'd ship a lot of ARM virtually-tagged devices.  So we added MARSHAL_FORCE_ALIAS with the expectation that callers would use it with caution, and deal with cache coherency problems themselves.  That, at least, could probably win some performance on large buffers, even if it did cost complexity.

Later, we got our heads on right and fixed aliasing to leave memory uncached.  So duplication was no longer as important.  But we also made a discovery -- on small buffers, duplication was *faster* than aliasing!  We did some benchmarking and decided that for buffers below 16KB, we'd duplicate, while on larger buffers we'd alias.  But we'd only benchmarked ARM virtually-tagged devices, and so we left the code similar to its original state.  Meaning that we only made the aliasing vs. duplication decision based on size on ARM virtually-tagged devices.  For all other cases, CeAllocAsynchronousBuffer usually aliased.

At that point, in my opinion, we should have removed the MARSHAL_FORCE_ALIAS flag.  Instead, we left it, and now we're in a state where it confuses people.  At TechEd I saw my colleagues recommend it to driver developers for performance reasons - when in my opinion it should never be used.  Let the OS make the decision what's best for performance.  The only case where we don't alias is for small buffers on ARM virtually-tagged caches, where we've demonstrated that duplication is faster than aliasing.  I think it's safe to say, you can look forward to this getting cleaned up in the future.  But remember, my recommendation remains: don't (blindly) use MARSHAL_FORCE_ALIAS!  It won't break anything, but you'll potentially be forcing the wrong thing for performance.

 

Hi, I'm Chaitanya Raje and I am a developer on Compiler and Tools team for Windows Mobile and Windows Embedded CE. This is my first blog on msdn. I hope I will be able to share out some insights into new features and commonly known issues about using the compilers and related tools through my blogs.

 

I would like to start with a write-up on dynamic initialization of variables in C++. C++ (but not C) allows you to initialize global variables with non-constant initializers. For e.g.:

 

Foo.cpp

#include <stdio.h>

int alpha(void)

{

    return 20;

}

 

int i = alpha(); //dynamic intialization

 

int main()

{

    printf("i = %d",i);

    return i;

}

 

According to the C/C++ standards global variables should be initialized before entering main(). In the above program, variable 'i' should be initialized by return value of function alpha(). Since the return value is not known until the program is actually executed, this is called dynamic initialization of variable.

 

The Problem:

Let us compile the above program and link with entrypoint ‘main’.

 

cl Foo.cpp /link /entry:main

 

Here’s your output when you run the exe –

 

i = 0

 

Surprised? We all expected the output to be -”i = 20”. Let us try to understand why we got an unexpected output.

 

The Theory:

The global ‘i’ has a dynamic initializer, so its value is not initialized until the program is executed. Since we linked the exe with entrypoint as’ main’, the C Runtime started executing ‘main()’ as the first function in your program. ‘alpha()’ was never invoked and ‘i’ was never initialized, hence the unexpected output.

 

Now the question is how do we invoke these dynamic initializers before ‘main()’ and still keep the entry point of our program as ‘main()’?

 

The Solution:

The answer lies in C Runtime's startup routines. C Runtime (CRT) defines different startup routines corresponding to your standard entry points as follows –

 

Your entrypoint

CRT entrypoint

 

 

main

mainCRTStartup

wmain

wmainCRTStartup

WinMain

WinMainCRTStartup

wWinMain

wWinMainCRTStartup

DllMain

_DllMainCRTStartup

 

The above CRT startup routines are designed to invoke dynamic initializers in your program to initialize the global variables and then call the corresponding standard entry point. So, if your program uses dynamic initializers, you should set your entry point to one of the CRT startup routines (corresponding to your real entry point from the table above) while linking. Not using the CRT startup routine as an entrypoint (and using a standard entrypoint instead) will keep the global variables that need dynamic initialization, uninitialized.

 

Now let’s compile and link Foo.cpp with CRT entrypoint –

 

cl Foo.cpp /link /entry:mainCRTStartup

 

Here’s your output as expected –

 

i = 20

 

NOTE: The above program will generate a compiler error if compiled as a C program (instead of C++) because dynamic initializers are not allowed by C language.

 

Here are a few more examples of dynamic initializers-

 

1.

class B {

public:

    int i;

    B() {

        i=10;

    }

    ~B() {};

}

B b; //requires dynamic initializer to call constructor B().

 

A global object is a classic example of dynamic initializer. The constructor on a global object needs to be invoked before we enter main.

 

2.

extern char ValueKnown[];

char* Name1 = ValueKnown; //statically initialized with &ValueKnown[0]

#if defined(__cplusplus)

    extern char* ValueUnknown;

    char* Name2 = ValueUnknown; // requires dynamic initializer

#endif

 

ValueKnown and ValueUnknown, though they look very similar, there’s a very subtle difference between them. ValueKnown is a statically initialized array and hence its value (and location) is guaranteed to be known while linking with (and in the .data section of) module in which it is defined. ValueUnknown on the other hand is a char pointer variable whose value may or may not be known at compile-time or during linking with module that defines it. It could be pointing to a constant string or it could have a dynamic initializer itself (in module defining it). This makes the compiler generate a dynamic initializer for variable Name2.

 

 

More details:

Some of you might be curious to know how CRT finds information about dynamic initializers.  The compiler actually sets up things for the CRT. It creates a section named .CRT$XCU in your object file with useful information for the CRT. This section is essentially a list of function pointers or pointers to class constructors which are dynamic initializers for your program. The CRT just loops through this list and invokes them as it goes along. The compiler generates an entry into this section every time it finds a dynamic initializer in your code.

 

The section name is .CRT and XCU is name of the group.              

 

The CRT also defines 2 pointers

- __xc_a in section .CRT$XCA

- __xc_z in section .CRT$XCZ

 

The linker then merges all .CRT groups into one section and orders them alphabetically by group name. This causes the pointers to be laid out as follows -

 

.CRT$XCA

            __xc_a

.CRT$XCU

            Pointer to Global Initializer 1

            Pointer to Global Initializer 2

.CRT$XCZ

            __xc_z

 

__xc_a and __xc_z thus act as demarcations for start and end of dynamic initializer list. CRT can now loop through this list at the startup. Note that order of initialization across modules is neither defined nor easily predictable.

 

 

 

I hope this has given you some insight into the C Runtime's initialization mechanism, but the real point I wanted to convey from this blog is - try to use CRT entrypoints instead of the standard main/Winmain to avoid surprises in your output.

 

If you have any question or comments regarding this topic, please let us know. We'll be more than happy to answer them! If you would like us to write on any particular topic related to compilers and related tools like linker, runtime libraries, etc. we are open to recommendations.

 

Thanks.

 

Chaitanya Raje

on behalf of  Windows Devices Compiler Team

New sample code called the BSP Template is now available for download.  This code serves two major purposes:

1. Provide a stub version of a BSP that illustrates all required and optional BSP functions.

2. Educate newcomers to CE on the basics of BSPs in an incremental fashion.

The BSP Template is compatible with CE6.0 and CE6R2.  You can find it attached to this post.

Nicolas Besson, one of our MVPs, posted a nice series of articles about power management in Windows CE that I thought I'd bring some attention to:

 

For those of you that enjoyed Sue's excellent article CE6 OAL: What you need to know, the presentation the article draws from is now posted online at Channel9 here: Porting a CE5.0 BSP to CE6.0.

Hopefully we'll post a similar presentation about porting kernel-mode drivers in the future.

Posted by Wes Barcalow

Following on to Sue’s previous posts describing the paging pool and memory management, I wanted to talk about how drivers can be made pageable for additional virtual memory savings.

Windows CE has features to allow for more data and code to be used on a device than the available RAM.  It does this by ‘paging’ resources into RAM from fixed or read-only storage (ROM/Flash), and discarding pages if the overall amount of RAM available in the system becomes too low.  In systems where code cannot execute directly from ROM, this paging is the only available way to use storage to offset RAM usage.  This is the case for NAND Flash, which is more perfomant and of lower cost than NOR Flash (which does allow XIP or eXecute In Place).

Some code and data in the system is read (‘paged’) into RAM and ‘locked’ there – it is marked as non-pageable after it is loaded.  This code and data must be available, namely when the storage it was retrieved from is no longer available.  For example, to achieve the best power saving on entry to a low-power or deep-idle mode, it is preferable to turn off the power to a NAND Flash chip. 

Applications are typically pageable, since the operating system completely stops the threads of  applications before entering a low power mode.  At this point, since the application's code will not be executed and it's data cannot be accessed, such code and data can be ‘paged out’ and is not needed.  For device drivers things are slightly different.  Most drivers written for Windows CE / Windows Mobile are by default loaded non-pageable by device manager.  This means that no matter how big the driver is, it takes up all the RAM it wants to once it is loaded – none of it can be paged out. In the case of user mode drivers, udevice.exe loads the driver instead of device manager, but it too uses the same criteria for choosing between pageable and non-pageable modes.

With an increase of functionality or flexibility in a driver comes an increase in size.  A camera driver that supports many formats or many features may be very large.  However, if the camera is not used for a long time, then the RAM resources taken up by it is not being put to efficient use.  It makes sense to make this type of driver pageable by default instead.

To make a driver pageable, these steps have to be taken.

1)      Tell device manager you want the driver to be pageable.

2)      Tell the kernel that pageable mode is allowed.

3)      Identify and flag code that is needed to be non-pageable.

The last step is slightly more complicated than the first two steps.  Even though you may have a large driver, you may still need portions of it to be non-pageable.  The most important parts of a Windows CE / Windows Mobile driver that cannot be paged out are functions that execute when the file system is not in operation.  If you do not have such functions in your driver then you do not have to worry about making them non-pageable.

Marking a driver as pageable needs to happen in two steps; the first is with a registry setting for that driver. It may or may not already have a “Flags” registry entry. To enable the driver to be paged ensure there is a registry value named “Flags” of type DWORD entry and that in its value the DEVFLAGS_LOADLIBRARY bit is set (0x02).  If there are other flag bits set, simply logical ‘or’ this with what is already there.

Here is an example of what a GPIO driver registry setting might look like in platform.reg:

[HKEY_LOCAL_MACHINE\Drivers\BuiltIn\GPIO]

   "Dll"="gpio.dll"

   "Flags"=dword:10002   ;Trusted caller only & pageable

   ...

The second step for marking a driver pageable is ensuring the ‘M’ flag of the binary image builder file (BIB file) is not set. The purpose of the ‘M’ flag is to inform the kernel not to demand page the driver, thus forcing the driver to be completely loaded into RAM.

Here is an example of what a GPIO bib file entry might look like that allows the driver to be loaded in a pageable mode by the kernel:

msm7x00_gpio.dll $(_FLATRELEASEDIR)\gpio.dll  NK SH

 

Notice the flags at the end of the statement, there is no ‘M’ flag. A user wishing to force the driver into a non-pageable mode would use “SHM” instead of “SH”. Or alternatively, a user wishing to force the driver into a non-pageable mode would clear the DEVFLAGS_LOADLIBRARY bit in the registry. Either approach is valid.

It is also worth pointing out that a trusted user can potentially change the registry after run time, thus changing a driver from non-pageable to pageable and back again. The bib file flag, however, is built into the image and cannot be overridden. Both are viewed as equally secure as only a trusted caller can change the registry, though the bib file flag provides a predictable pageable status when loading the driver.

The final, more complicated step from above is to identify and isolate code that can’t be pageable. As mentioned above, this is code that runs in single threaded mode where the file system cannot page in or out code and data.  The most well-known examples of this are:

-          XXX_PowerUp

-          XXX_PowerDown

-          Interrupt Service Threads and Interrupt Service Routines (ISTs and/or ISRs) that may execute while the file system is inactive.

-          Read-Only constants that are accessed by these functions.

-          Any supporting code called by these functions.

-          All code associated with the file system path, as it is responsible for bringing in new pages.

Once the code is identified, it should be wrapped in compiler #pragma statements to inform the linker about the properties of the code.  Below is an example of making xxx_PowerUp and xxx_PowerDown non-pageable.

#pragma comment(linker, "/section:.no_page,ER!P")

#pragma code_seg(push, ".no_page")

XXX_PowerDown()

{

      //Perform single-threaded power off logic

}

 

XXX_PowerUp()

{

      //Perform single-threaded power on logic

}

UtilityFuncOne()

{

      // Non-Paged utility function that can be called by

      // both page and non-paged code

}

#pragma code_seg(pop)

 

UtilityFuncTwo()

{

      // Paged utility function that can only be called by other

// paged code.

}

 

This sample code shows the XXX_PowerDown and XXX_PowerUp code being marked as pageable. This will allow the processor to access this code in RAM while the file system is not in operation (during suspend and resume operations). UtilityFuncOne is also in the non-paged section of code, thus making it safe to call from within XXX_PowerUp/Down. However the UtilityFuncTwo code is outside of the non-paged area, and therefore pageable and at risk of not being available if the processor were to try to access it while performing suspend / resume operations.

To test for drivers marked as pageable that are critical to suspend, resume, and shutdown code paths the registry key PageOutAllModules can be used to instruct the kernel to page out all code. This can be used to find drivers that use pageable code when calling XXX_PowerUp and XXX_PowerDown API’s while the file system is inactive.  By generating page faults, problematic drivers can be identified more easily. Below is what the registry key looks like:

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Power]

"PageOutAllModules"=dword:1

 

Set this registry key and force the system to suspend, resume, or shutdown. The OS will then page out all code marked as pageable and proceed with suspend / resume / shutdown operation. If a critical driver is improperly marked as pageable then this process will generate a page fault and device will die. This technique will help ensure that all drivers a properly marked as pageable/non-pageable when preparing to release the device to market.

By making your driver pageable you can decrease the load on the system for resources while a component or feature is not being used.  It is important to take care as outlined above to make sure some important parts of your driver can still function even though in general the bulk of it is ‘paged’.

 

I didn't learn about Reed & Steve's blog until today, but got there by learning about these posts:

If you have memory issues on CE/Mobile (especially if you already know that they're virtual memory problems) you may find those useful.

 

Posted by: Sue Loh

I’d like to explain a little more about memory management in Windows CE.  I already explained a bit about paging in Windows CE when I discussed virtual memory.  In short, the OS will delay committing memory as long as possible by only allocating pages on first access (known as demand paging).  And when memory is running low, the OS will page data out of memory if it corresponds to a file – a DLL or EXE, or a file-backed memory-mapped file – because the OS can always page the data from the file back into memory later.  (Win32 allows you to create “memory-mapped files” which do or do not correspond to files on disk – I call these file-backed and RAM-backed memory-mapped files, respectively.)  Windows CE does not use a page file, which means that non file-backed data such as heap and thread stacks is never paged out to disk.  So for the discussion of paging in this blog post I’m really talking only about memory that is used by executables and by file-backed memory-mapped files.

It’s relatively easy to guess how the OS decides when to page data in to memory – it doesn’t page it in until it absolutely has to, when you actually access it.  But how does the OS decide when to remove pageable data from memory?  Ahh, that’s the question!

The Paging Pool and How It Works

Back in the old days of CE 3.0 or so (I’m not sure) – Windows CE did not have a paging pool.  What that means is that the OS had no limit on the number of pages it could use for holding executables and memory-mapped files.  If you ran a lot of programs or accessed large memory-mapped files, you’d see memory usage climb correspondingly.  Usage would continue to go up until the system ran out of memory.  Other allocations could fail; memory would appear to be nearly gone when really there was actually a lot of potential to free up space by paging data out again.  Until finally when the system hit a low memory limit, the kernel would walk through all of the pageable data, paging everything (yes, everything) out again.  Then suddenly there would be a lot of free memory, and you’d take page faults to page in any data you’re still actually using.

The algorithm is simple, but it has a few bad effects.  First, a bad effect of the simple paging algorithm was, obviously, that the system could encounter preventable RAM shortages.  Also, it was really tough for applications or tools to measure free memory – where “free” includes currently-unused pages plus “temporary” pages that could be decommitted when necessary.  Conversely, it was difficult for users to determine how much of an application’s memory usage is fixed in RAM vs. “temporary” pageable pages.   Even today it is tough to answer the question “how much memory is my process using?” in simple terms without diving into explanations of paging, cross-process shared memory, etc.  Another possible problem you can encounter when there’s no paging pool is that the rest of the system can take up all of the free memory, and leave you thrashing over just a few pages.

So we introduced the paging pool.  The purpose of the paging pool is to serve as a limit on the amount of memory that could be consumed by pageable data.  It also includes the algorithm for choosing the order in which to remove pageable data from memory.  Pool behavior is under the OEM’s control – Microsoft sets default parameters for the paging pool, but OEMs have the ability to change those settings.  Applications do not have the ability to set the behavior for their own executables or memory-mapped files.

Up to and including CE 5.x, the paging pool behavior was fairly simple.

·         The pool only managed read-only pageable data.  Executable code is read-only so it used the pool, and so did read-only file-backed memory-mapped files.  Read-write memory-mapped files did not use the pool, however.  The reason is that paging out read-write data can involve writing back to a file.  This is more complicated to implement and requires more care to avoid file system deadlocks and other undesirable situations.  So read-write memory-mapped files had no memory usage limitations and could still consume all of the available system RAM.

·         The pool had one parameter, the size.  OEMs could turn the pool off by setting the size to 0.  Turning off the paging pool meant that the OS did not limit pageable data – behavior would follow the pattern described above from before we had a paging pool.  Turning on the pool meant that the OS would reserve a fixed amount of RAM for paging.  Setting the pool size too low meant that pages could be paged out too early, while they’re still in use.  Setting the pool size too high meant that the OS would reserve too much RAM for paging.  Pool memory would NOT be available for applications to use if the pool was underutilized.  A 4MB pool took 4MB of physical RAM, no matter whether there was only 2MB of pageable data in use or 100MB.  Setting the size of the pool was a tricky job, because you had to decide whether to optimize a typical steady-state situation with several applications running (and judge how much pool those applications would need), or optimize “spike” situations such as system boot where many more pages were needed for a short period of time.

·         The kernel kept a round-robin FIFO ring of pool pages: the oldest page in memory – the earliest one to be paged in – was the first one paged out when something else needed to be paged in, regardless of whether the oldest page was still in use or not.

 

So the short roll-up of how the paging pool worked up through CE 5.x is that the paging pool allowed OEMs to set aside a fixed amount of memory to hold read-only pageable data, and it was freed in simple round-robin fashion.

In CE 6.0, the virtual memory architecture changes involved major rewriting of the Windows CE memory system, including the paging pool.  The CE 6.0 paging pool behavior is still fairly simplistic, but is a little bit more flexible.

·         CE 6.0 has two paging pools – the “loader” pool for executable code, and the “file” pool which is used by all file-backed memory-mapped files as well as the new CE 6.0 file cache filter, or “cache manager.”  This way, OEMs can put limitations on memory usage for read-write data in addition to read-only data.  And they can set separate limitations for memory usage by code vs. data.

·         The two pools have several parameters.  Primary of these are target and maximum sizes.  The idea is that the OS always guarantees the pool will have at least its target amount of memory to use.  If memory is available, the kernel allows the pool to consume memory above its target.  But when that happens, it also wakes up a low-priority thread which starts paging data out again, back down to slightly below the target.  That way, during busy “spikes” of memory usage, such as during system boot, the system can consume more memory for pageable data.  But in the steady-state, the system will hover near its target pool memory usage.  The maximum size puts a hard limit on the memory consumption – or OEMs could set the maximum to be very large to avoid placing a limit on the pool.  OEMs can also get the old pre-CE6 behavior by setting the pool target and maximum to the same size.

·         Due to the details of the new CE6 memory implementation, the FIFO ring of pages by age was not possible.  The CE6 kernel pages out memory by walking the lists of modules and files, paging out one module/file at a time.  This is no better than the FIFO ring, but still leaves us potential for implementing better use-based algorithms in the future.

 

There are some more details in our documentation under “Paging Pool” and “Paging Pool: Windows CE 5.0 vs. Windows Embedded CE 6.0.”

Overall, enabling the paging pool means that there is always some RAM reserved for code paging and we will be less likely to reach low-memory conditions.  In general it's better to turn on the paging pool because it gives you more predictable performance, rather than occasional long delays you’d hit when cleaning up memory when you run out.  But it does need to be sized based on the applications in use, which leads to my next point...

Choosing a Pool Size

In Windows CE (embedded) 5.0, the pool is turned off by default.  In Windows Mobile, the pool is turned on and set to a default size chosen by Microsoft.  I believe it varies between versions, but is somewhere in the neighborhood of 4-6 MB.  In CE6, the loader pool has a target size of 3MB and the file pool has a target size of 1MB.  Only the OEM of a device can set the pool size; applications cannot change it.

So how do you decide on the right pool size for your platform?  I’m afraid it’s still a bit of a black art.  :-(  There aren’t many tools to help.  You can turn on CeLog during boot and see how many page faults it records.  You can see the page faults in Remote Kernel Tracker, but in truth that kind of view isn’t much help here.  The best tool I know is that readlog.exe will print you a page fault report if you turn on the “verbose” and “summary” options.  If you get multiple faults on the same pages, your pool may be too small (you may also be unloading and re-loading the same module, ejecting its pages from memory, so look for module load events in the log too).  If you don’t get many repeats, your pool may be bigger than you need.  In CE6 you can use IOCTL_KLIB_GET_POOL_STATE to get additional information about how many pages are currently in your pool and how many times the kernel has had to free up pool pages to get down to the target size.  There aren’t any tools like “mi” that query the pool state, so you’ll have to call the IOCTL yourself.  On debug builds of the OS, there is also a debug zone in the kernel you can turn on to see a lot of detail about paging and when the pool trim thread is running.  But CeLog is probably a better choice to collect all of that data.

As I already mentioned, as of CE6 you can set separate “target” and “max” values for the paging pools.  I don’t really like the semantics of having a “max” – it isn’t dependent on the other usage or availability in the system.  If some application takes most of the available memory in the system, you’d want the pool to let go of more pages.  If you have a lot of free memory, and some application is reading a lot of file data, you’d want the pool to grow to use most of the available memory.  We supported the “max” as an option to limit the pool size, but I’m starting to think the best idea is to set your max to infinity, to let the pool grow up to the size of available memory.  We’ll still page out down to the target in the background.  I’d have liked to add more sophisticated settings like “leave at least X amount of free memory” but that’s quite difficult to implement.

You’ll want to examine your pool behavior during important “user scenarios” like boot or running a predefined set of applications.  If the user runs a lot of applications at once, or a really big application, or one that reads a lot of file data, they could go through pool pages pretty quickly.  There isn’t really a lot you can do about that.  We don’t even have a set of recommended scenarios for you to examine.  I wish we had more information and more tools for this, but I’ve described about all we have.

The approach I think most OEMs take is that they leave the pool at the default size until they discover a perf problem with too much paging (by profiling or otherwise observing) in a scenario that's important to users.  Then they bump it up until the problem goes away.  Not very scientific but it works, and it's not like we have any answer that's more scientific anyway.

What goes into the paging pool

This is repeating some of the information above, but in more detail – how do you know exactly what pages will use the pool and what pages won’t?  Keep in mind that paging is actually independent of the paging pool.  Paging can happen with or without the paging pool.  If you turn off the paging pool then you turn off the limit that we set on the amount of RAM that can be taken up for paging.  But pages can still be paged.  If you turn ON the paging pool then we enforce some limits, that’s all.  So this isn’t really a question of what pages can use the pool, it’s a question of what pages are “pageable.”

Executables from the FILES section of ROM will use the paging pool for their code and R/O data.  R/W data from executables can’t be paged out, so it will not be part of the pool.  Compressed executables from the MODULES section of ROM will use the pool for their code and R/O data.  If the image is running from NOR or from RAM, uncompressed executables from MODULES will run directly out of the image without using any pool memory.  Executables from MODULES in images on NAND will be paged using the pool.  (And by the way, I’m not terribly familiar with how we manage data on NAND/IMGFS so I might be missing some details here.)

Executables that would otherwise page but are marked as “non-pageable” will be paged fully into RAM as soon as they’re loaded, and not paged out again until they’re unloaded.  These pages don’t use the pool.  You can also create “partially pageable” executables by telling the linker to make individual sections of the executable non-pageable.  Generally code and data can’t be pageable if it’s part of an interrupt service routine (ISR) or if it’s called during suspend/resume or other power management, because paging could cause crashes and deadlocks.  And code/data shouldn’t be pageable if it’s accessed by an interrupt service thread (IST) because paging would negatively impact real-time performance.

Memory-mapped files which don’t have a file underneath them (a.k.a. RAM-backed mapfiles) will not use the pool.  In CE5 and earlier, R/O file-backed mapfiles will use the pool while R/W mapfiles will not.  In CE6, all file-backed memory-mapped files use the file pool.  And the new file cache filter (cache manager) essentially memory-maps all open files, so the cached file data uses the file pool.

To look at that information from the opposite angle, if you are running all executables directly out of your image – all are uncompressed in the MODULES section of ROM, and the image is executing out of NOR or RAM, then the loader paging pool is probably a waste.  You might still want to use the file pool to limit RAM use for file caching and memory-mapped files, but in that case you might want to turn off the loader pool.

Other Paging Pool Details

Someone once asked me whether the pool size affects demand paging.  It doesn’t change demand paging behavior or timing.  Demand paging is about delaying committing pages as long as possible, and it applies to pages regardless of the paging pool.  Pages can be demand paged without being part of the pool; they won’t be paged in until absolutely necessary, and then they’ll stay in RAM without being paged out.  Pool pages will be demand paged in, and may eventually be paged out again.

Another question was whether the paging pool uses up virtual address space.  Actually, no, it doesn’t.  The pool pages that are currently in use are assigned to virtual addresses that are already reserved.  For example, when you load a DLL, you reserve virtual address space for the DLL; and when you touch a page in the DLL, a physical page from the pool is assigned to the already-reserved virtual address in your DLL.  The pool pages that are NOT in use are not assigned virtual addresses.  The kernel tracks them using their physical addresses only.  The pool *does* use up physical RAM.  In CE5 it uses the whole size of the paging pool; on CE6 it consumes physical memory equal to the “target” size of the pool.  This guarantees that you have at least a minimum number of pages to page with, to avoid heavy thrashing over just a few pages when the rest of the memory in the system is taken.

Other Paging-Related Details

A related detail that occasionally confuses people is the “Paging” flag on file system drivers.  This flag doesn’t control whether the driver code itself is pageable.  Rather, it controls whether the file system allows files to be loaded into memory a page at a time or all at once.  On typical file systems like FATFS the “Paging” flag is turned on, allowing executables and memory-mapped files to be accessed a page at a time.   On other file system drivers, such as our release directory file system (RELFSD) and our network redirector, it’s turned off by default, causing executables and memory-mapped files to be read into memory all at once.  I believe the reasoning is to improve performance and minimize problems when the network connection is lost.

This flag actually derives from the original Windows CE implementation of memory-mapped files.  If the file system supported a couple of APIs, ReadFileWithSeek and WriteFileWithSeek, memory-mapped files on that file system would be pageable.  If the file system did not support those APIs, the memory-mapped files would be non-pageable, in which case they’d be read entirely into RAM at load time and never paged out until the memory-mapped file is unloaded.  The OS required pageability for special memory-mapped files like registry hives and CEDB database volumes, so file systems that did not support the required APIs could not hold these files.  (If you ask me, there is no real need to require the seek + read/write to occur in one atomic API call, so the requirement on the “WithSeek” APIs was unnecessary, but perhaps there was a good reason back in the old days.)

As I already mentioned, the new CE6 file cache also uses the paging pool.  The file cache is basically just memory-mapping files to hold the file data in RAM for a while.  The file cache is enabled by default on top of FATFS volumes.

Posted by Kurt Kennett, Senior Development Lead, Windows CE OS Core

Operating system code, as one of my colleague developers recently realized, is “just code”.  It’s not voodoo and it does not exist on a higher plane of knowledge.  In fact, an operating system kernel is usually remarkably well structured and well designed in comparison to other pieces of software.  When you think about it, it has to be.  More than one person needs to understand and maintain a core set of code that must work and must support debugging of all other software that runs upon it.  People move on, and change jobs to look for new challenges to keep learning.  If only one person understood the way an operating system worked, then there is a huge amount of risk.

One of the most interesting facets to Operating Systems that I’ve followed in my career is how they start.  Initialization is the last step in design, but at the same time it uncovers the most fundamental bedrock of the principles used.  You start with literally nothing but a CPU which can execute instructions (sometimes not even with memory to use), and must take a platform from that point to a fully functioning system - one that not only utilizes available hardware, but abstracts it to a common understanding.

What I’m going to do in this article is discuss the details of how the Windows CE 6.0 kernel starts, and the association of the ‘Microsoft’ kernel code with the code that comes from an Original Equipment Manufacturer (OEM).  It is hoped that by relating this understanding more people will have a better idea of the ‘hows’ and ‘whys’ of the Microsoft design.

To start, let’s quickly review how the operating system software is built.  The Microsoft tool chain emits .EXE and/or .DLL program files.  Files of both these types of extension are “Portable Executable” format, or “PE” format.  They are practically identical in every aspect:

  • They are extended Common Object File Format (COFF) format files
  • They have import tables and export tables (EXE export tables are usually blank)
  •  They have an entry point defined in their headers for where execution should start

There is nothing extraordinary about the operating system kernel program – it is compiled using the standard compiler and with a minimal set of definitions for that compiler.  An EXE file is produced (called NK.EXE).  It does not link to any external library or DLL – it can’t.  When this code starts there is nothing in the system, or even a system for that matter.  Since the EXE is in a known format (PE COFF), you can determine the entry point from looking at the EXE header.  This means we know where to set the CPU’s instruction pointer to so that the program can start.

One additional property is that a PE file can be arranged so that it may “execute in place”.  This means that if the file data is placed at a particular virtual address, no changes need to be made to the program code in the file in order for it to address other code and data at the correct addresses.  For example, I can tell the Microsoft linker program to place the kernel program file at the virtual address 0x80000000.  Then references to code (function entry points) will be placed in the EXE file such that other code can jump to them by address.  If function foo() is at address 0x80001000 and inside its body it calls a function bar() which sits at address 0x80005000, there will be an instruction stored directly in the program code for ‘foo()’ that calls to 0x80005000.  The dotted lines are just the delineation of function code start or end.

If the EXE program file for the kernel could not sit at 0x80000000 and had to be moved, the ‘bar()’ function would move with it and the call instruction in ‘foo()’ would have to be changed to have the correct, new address.  Otherwise it would call to the wrong place:

 

You can see in the example above that if the kernel EXE file that is designed to be placed at 0x80000000 is loaded at 0x80050000 instead, the instructions in the program will be incorrect.

The process of changing an EXE or DLL program file after it has been loaded to reflect the actual load address is called “fixing up”.  Records are placed in a standard EXE file which allow the program file to be fixed up.  However, until the fixup process is done the addresses of functions in the EXE will be incorrect.  To get around this, Windows CE kernel EXE files are fixed up beforehand to be loaded at a specific address.  A program called ROMIMAGE actually pre-processes the kernel EXE file and some DLLs that are used and fixes them up when it builds the operating system image file (NK.BIN).

To recap, we get a fixed-up EXE file which is called NK.EXE which contains portions of the operating system kernel.  This EXE has an entry point defined in it, the same as every other COFF EXE or DLL.  For execution to start, the bootloader for the system is supposed to put the image file at the right address, find this EXE entry point and jump to it. The bootloader is a separate discussion, and its startup and execution is very platform-specific.  For the context of this article, we will simply assume that a bootloader places the OS image file into memory at a specific address.  We will see below how the bootloader can find the NK.EXE file within the image and then find its entry point.

The NK.EXE is only part of the Windows Embedded CE 6.0 kernel – it comprises the OEM Adaptation Layer (OAL) and boilerplate code to start the system.  The main portion of the operating system kernel that does all the process, thread and memory functionality lives in a Microsoft-supplied DLL called ‘kernel.dll’.  This is a DLL which is also ‘fixed up’ by the ROMIMAGE program to live at a specific virtual address in memory.  So this means there are at least two executable modules that we need to know the location and the entry point of.  The entry point address is stored inside the EXE or DLL file, but what about the location of the EXE and DLL files inside the image?

Windows CE images have an important structure set up by ROMIMAGE that is placed into the image file, called the “Table Of Contents”, or TOC.  This TOC holds pointers and metadata for the operating system image file.  Somewhere near the beginning of the image file a marker is placed – the bytes “CECE” (0x44424442).  Right after this marker is placed an offset to the TOC.  This allows a bootloader or other program looking at the file to be able to find information about the image.  In addition to this offset value that is prefixed by a marker, the OAL must define a public symbol called ‘pTOC’ (exported using ‘C’ naming conventions), which ROMIMAGE can find and fill in with the virtual address of the TOC when it prepares the image file.  When compiled, the pTOC variable in the NK.EXE must have the value 0xFFFFFFFF.  When it prepares the NK.BIN OS system image, ROMIMAGE does the following (in addition to other tasks):

  1. Load NK.EXE and fix it up.
  2. Make the TOC and find a place for it in the image file (will live in virtual memory when the os image is loaded).
  3. Find the ‘pTOC’ variable in the NK.EXE file and make sure it has the current value 0xFFFFFFFF.
  4. Set the pTOC variable value to the virtual address of the TOC that was created in step (2).

This way, when the NK.EXE starts it can reference this variable to know where the TOC is.  Using the TOC, the program can find all the other pieces of the operating system image.

ROMIMAGE uses the configuration .BIB files to know where the image is supposed to go and where RAM is.  There are two important parts of the CONFIG.BIB file – the RAMIMAGE and the RAM lines.  Here is an example from the Device Emulator’s CONFIG.BIB:

    NK      0x80070000   0x02000000    RAMIMAGE
    RAM     0x82070000   0x01E7F000    RAM

These entries tell ROMIMAGE what to do.  It knows to place the OS image file at 0x80070000, and that it can start using read/write memory at 0x82070000.  With this information it can place modules such as NK.EXE and KERNEL.DLL into virtual memory, and then build a TOC and put that into the image as well.  To help the kernel start, the TOC also contains information on where RAM is.  A more detailed look at what is in memory when the image file has been placed is shown below:

In order for the actual operating system to start, the bootloader needs to:

  1. Put the image file at the right place in memory.
  2. Find the “CECE” marker.
  3. Use the TOC pointer that comes right after it to find the TOC.
  4. Search the TOC for the “NK.EXE” file entry.
  5. Scan the EXE file to find its entry point (it is a standard PE format file).
  6. Jump to the address that corresponds to the entry point.

The really interesting stuff happens once the NK.EXE program is started.  In broad strokes, it has its own tasks to perform:

  1. Set up virtual memory and turn it on.
  2. Gather important information that the KERNEL.DLL will need to use to run the system.
  3. Use the pTOC to scan the TOC for the KERNEL.DLL file inside the operating system image.
  4. Find the entry point of KERNEL.DLL (it is a standard PE format file).
  5. Pass critical information gathered in (2) to KERNEL.DLL in a call to its entry point.

We will walk through these activities in detail to better understand them.  Some parts of the startup process are CPU-type-specific.  For instance, the ARM CPU and the X86 CPU have different virtual memory management hardware and mapping structures.  However, to keep things consistent a general process is maintained.  Whenever possible I will attempt to call out any operations specific to an architecture.

When the NK.EXE starts, there are a few prerequisites of the system:

  1. All caches are disabled
  2. The entire RAMIMAGE and RAM regions specified in the CONFIG.BIB file are physically addressable and readable. 
  3. Virtual Memory is in a predefined state (CPU typically executes in physical address mode).

An additional prerequisite can be satisfied before NK.EXE starts, or can be done in the very beginning of NK.EXE execution:

    4. RAM should be writeable without any supplemental configuration (for example, of a memory controller).

These assumptions allow the NK.EXE startup code to do what is necessary to bring any particular system up, and not have to worry about some things being done and others not being done.  Point (3) above may be counterintuitive, but since the kernel must be entirely self-contained, it does not make sense for it to rely on the bootloader to configure virtual memory properly before it starts.  This ‘decouples’ the OS from whatever bootloader is used to start it.

When it starts executing instructions in physical address mode, the first action taken by NK.EXE is to calculate the physical address of the OEMAddressTable symbol.  This is a table that is built into the kernel that defines the static (unchanging) default regions of virtual memory. NK.EXE knows:

  1. It’s own location in virtual memory (where it will be executing instructions)
  2. It’s own location in physical memory (where it currently is executing instructions)
  3. The virtual address of the OEMAddressTable (it was determined when the NK.EXE was built and subsequently fixed up by ROMIMAGE).

Using this information, a simple calculation tells it the physical address of OEMAddressTable:

NK::PhysicalBase + (NK::Virtual OEMAddressTable – NK::Virtual Base) è NK Physical OEMAddressTable 

The OEMAddressTable has triads of DWORDS making up a line in a table, with the following format: 

<region virtual start>          <region physical start>       <region size in MB>

<region virtual start>          <region physical start>       <region size in MB>

...

From the information found in this table, the NK.EXE program can set up the virtual memory mapping tables for the Memory Management Unit (MMU) to function.  Where the MMU-formatted mapping tables are kept and what they look like is platform-specific – the OEMAddressTable is a simplistic format that works for any architecture.  Virtual memory is set up using the data in the OEMAddressTable and enabled, and then the NK.EXE transitions to the virtual address where it can execute code.

One thing to note at this point is that anything that is supposed to be in RAM that needs to be pre-initialized (set to zero or some other known value) is not yet available.  RAM is still a clean slate and can have any contents whatsoever.  The initialization values in the image file (the .data sections of NK.EXE and other modules) for read/write data must be copied from the image to actual RAM addresses before they can be properly used.  How does the NK.EXE know what to copy or where to place things in virtual RAM for these modules?  The TOC.

The TOC not only lists the start addresses of all modules in the image, but it also describes RAM and where the read/write portions of each module are to be located so that the kernel can work with them.  Pieces of the OS image that need to be copied to RAM are called “copy entries”.  Before the NK.EXE can access its own read/write variables, it needs to copy the copy entries to RAM.  This begs the question – the pTOC is a variable, isn’t it?  How could the NK.EXE know where the pTOC is if it hasn’t been set up?  The answer is that the pTOC is a read-only variable – only ROMIMAGE writes to it when the image file is created.  The storage for pTOC is not located in RAM, and does not need to be copied before its value can be used.  The function inside NK.EXE that copies all the copy entries described by the pTOC to RAM is typically called “KernelRelocate()”.  It is a simple process of going through a simple table of structures and copying ranges of virtual memory from one place to another.  Once it is finished all NK.EXE variables can be read from or written to just like any other program.

 

At this point we have a working program, just like any other program for decades past.  It executes instructions, can call functions, and can read and write memory locations.  There are no threads, no processes, and no operating system constructs, but everything is placed in a known location and can be accessed to let us do the rest of the startup of the higher-level systems.

Virtual Memory allows a tremendous amount of flexibility.  Windows CE reserves a few regions of the virtual address range for its own private use inside the OS kernel.  There are several ranges of 4k ‘pages’ of virtual memory that are set aside in the highest address ranges, from about 0xFFFE0000 upwards.  The kernel maps some physical memory into this range to store its ‘global’ dynamic data.  Some of this memory can be used for memory mapping tables for an architecture-specific MMU.  Some is reserved for the kernel-mode and interrupt stacks. Most importantly, at least one of the 4k pages is reserved specifically as a ‘Kernel Data Page’.  This page contains a plethora of data fields which is specific to a version of the kernel.  The NK.EXE sets up the location and initial contents of this page directly. 

Three important values stored in the structure by NK.EXE:

  1. A copy of pTOC
  2. The address of OEMAddressTable. 
  3. The address of the function OEMInitGlobals()

The first two pieces of information are placed in the Kernel Data Page so that any code that knows the address of the Page can find what is in the OS image and the basic layout of virtual memory.  The last piece of information is specifically used so that the NK.EXE contents can be used once control has been passed to KERNEL.DLL.  In general, the contents of the reserved portion of virtual memory looks like:

 

Now that the Kernel data page has been initialized and virtual memory is active, we can jump into the Microsoft KERNEL.DLL executable’s entry point.  Remember, we can find the KERNEL.DLL file in the image by using the TOC, and then we can scan for the entry point of the module.  Even though NK.EXE knows where it is going to put the kernel data page in virtual memory beforehand, the KERNEL.DLL cannot assume its location.  Therefore, we pass the virtual address of the kernel data page to the entry point of KERNEL.DLL.  Although the Microsoft code can call back into the NK.EXE function addresses, control is never fully restored to the NK.EXE program.

After the jump, we are now executing Microsoft kernel code.  The code at the entry point is given the address of the Kernel Data Page, and through its fields the TOC to know anything it needs to know about the OS image.  The kernel does some basic setup of its own and sets some critical data fields for its own use into the Kernel Data Page.

The KERNEL.DLL has a static table of functions and data, called “NKGlobals”, which is built into its DLL simply as a static data structure.  Since the KERNEL.DLL is fixed up by ROMIMAGE to run from a particular virtual address, the function pointers in the NKGlobals will be correct when the KERNEL.DLL code starts to run.  Some of the functions pointed to by this structure are ones like SetLastError() and NKwvsprintfW().  These are routines that the NK.EXE is allowed to call directly.  However, it is important to note that at this point the NK.EXE does not know where these functions are in KERNEL.DLL – it still needs to be told where this table of functions and data is inside KERNEL.DLL .

The KERNEL.DLL passes the address of “NKGlobals” back to NK.EXE in a function call to OEMInitGlobals(), the address of which was left in the Kernel Data Page.  So, in essence the function call graph looks like this:

 

As shown above, the OEMInitGlobals() function stores a pointer to the NKGlobals structure that resides in KERNEL.DLL.  After it stores this pointer, NK.EXE can use it to find the addresses of the KERNEL.DLL functions it is allowed to call.

OEMInitGlobals also passes back (via function return value) a pointer to its own structure, called “OEMGlobals”.  This structure is critical to the kernel to get access to all the functionality that is platform-specific that is inside NK.EXE.  The KERNEL.DLL module is constructed so that it will run on any processor belonging to a certain architecture (X86, ARM, etc).  The NK.EXE is the abstraction of a specific species of the architecture (such as XSCALE or OMAP processor) and the platform that supports that architecture.  The OEMGlobals structure is comprised of function pointers and data just like NKGlobals.  Some of its members include:

  • PFN_InitDebugSerial(), PFN_WriteDebugByte(), PFN_ReadDebugByte()
  • PFN_SetRealTime(), PFN_GetRealTime(), PFN_SetAlarmTime()
  • PFN_Ioctl()

These function pointers point to the legacy OEM functions like OEMInitDebugSerial and OEMIoctl that live inside NK.EXE.  Many other functions are listed so that KERNEL.DLL can do what is necessary for a particular platform.  The functions are fairly self-explanatory in name and are well documented on MSDN.

Once the call to OEMInitGlobals() completes, the KERNEL.DLL has everything it needs to do architecture-generic and platform-specific processing.  It knows where memory is and how it is laid out virtually, as well as the location of every module in the image.  The NK.EXE also has a pointer to a table of functions it can call.  In essence, the two code modules have executed a manual ‘handshake’ by executing a simplistic method of manual dynamic linking.

Everything up to this point that NK.EXE and KERNEL.DLL have done has been done without any processes or threads, and without any kernel services running. To bring the rest of the system up, the KERNEL.DLL has to do three things:

  1. Architecture-specific setup
  2. Architecture-neutral setup
  3. Platform-specific setup (specific CPU and BSP initialization)

The architecture-specific setup is done first by a call to a KERNEL.DLL function called <architecture>Setup.  On an ARM platform this would be called ARMSetup().  On an X86 platform this would be called X86Setup().  The actions taken by the architecture-specific code are numerous, but they all execute in a single-threaded context with no processes running.  The actions taken here include but are not limited to:

  • Set up hard required page tables and reserve VM for kernel page tables
  • Update cache information in Page Tables
  • Flush the Transition Lookaside Buffer (TLB)
  • Set up architecture-specific buses and components (companion chips, coprocessors, etc).

The one other thing this architecture-specific code does is set up the Interlocked API code so that NK.EXE knows where it is and can call it.  This is a bit of an aside, but I will explain in detail because it is a critically important piece of the OS.

Even at the most basic level, Windows CE needs to coordinate actions among different threads of execution – even some that run inside the kernel, outside the scope of any specific process.  The mechanism used to do this with the highest amount of efficiency is the Interlocked API.  The API consists of a handful of functions, the most important of which is InterlockedCompareExchange().  The purpose of this function is to:

  1. Read a memory location (M) into register (R)
  2. Compare the value read (R) with a match value in another register (R2)
  3. If (R) and (R2) are not equal, exit
  4. Write the value of another register (R3) back to memory location (M)

These four steps are meant to execute atomically, and they form the basis of coordination between different threads.  That is, there should be no interruption between each of (1), (2), (3) and (4).  The only way to guarantee this on some of today’s processors where the operation is not available directly in hardware is to ensure interrupts are disabled.  Herein lies a problem, since user-mode processes do not have sufficient privilege to disable interrupts, and it would be very inefficient to have to do a system call to the kernel and disable interrupts every time two threads wanted to coordinate with each other.

To be efficient, there is one single place in the entire system where the InterlockedCompareExchange() happens.  The code for the four steps above is placed in the Kernel Data Page, at a particular location that is well known.   Then the NK.EXE and KERNEL.DLL (and any process which has the Kernel Data Page mapped) can call the code, and the instructions all occur in the same place.  This is done so that the API is restartable.  What does this mean?  Why do we do this?

Thread switches in an operating system can happen for three reasons:

  • It has been specifically requested by the executing thread
  • The thread’s time-slice has expired (noted by a timer interrupt event) and it is another thread’s turn to run. 
  • Another type of interrupt occurs, which causes a situation where a thread of higher priority should execute.

The second two cases are really the same – an interrupt occurs that ultimately causes a thread switch.  Since an interrupt can occur between any of the steps (1) to (4) and potentially switch out the thread, the operation we needed to be atomic might not be – some other thread might run in between (2) and (3), for example. 

To ensure that the instructions (1) to (4) occur atomically, every time there is an interrupt a simple bounds check is made to see if the CPU was currently executing somewhere in (1) to (4).  If the interrupt occurred when the CPU was executing after (1) and before (4), then the instruction pointer for the current thread is reset to point to instruction (1), so that the operation may be retried.  In order for the interrupt code to be able to check if the CPU was executing in between (1) and (4), the code for it must be in a single known location.  That location is inside the Kernel Data Page.

 Once the Interlocked API code has been copied to the Kernel Data Page, the NK.EXE knows where it is and can coordinate actions with KERNEL.DLL when multiple threads become active – ultimately by using the Interlocked API.

 Back onto our main discussion, the next step in the KERNEL.DLL startup is the architecture-neutral setup.  One of the first architectural-neutral things to set up is to see if the OS image includes a KITL.DLL to allow communication with and debugging of the OS kernel.

 KITL stands for “Kernel Independent Transport Layer”.  This is basically a mechanism by which data ‘packets’ specific to the Windows CE system can be passed between the kernel of the device and Platform Builder running on the desktop.  Usually, the portions of KITL which are implemented in NK.EXE purely revolve around the encoding for transport and the transport of the data packets.  A Board Support Package (BSP) does not have to know anything about the data being sent and received between the device and the desktop – it just has to facilitate the correct transmission and reception.  Mechanisms for transport of the KITL packets include but are not limited to RS232 Serial, Ethernet, and USB.  A full description of KITL is beyond the scope of this blog article.

 Other actions that happen during the architecture-neutral setup include:

  1. Initialize Kernel Debug Output (by calling OEMInitDebugSerial() through the function pointer in the OEMGlobals structure)
  2. Write a masthead debug string (“Windows CE Kernel Version xxxx”) to the debug output.
  3. Select the kernel processor type from the available options

When the architecture-neutral portions have been completed, we can do the platform-specific setup.  This code lives in NK.EXE since it is OEM and board specific.  To initialize this part, the kernel calls into OEMInit() through the function pointer that is in the OEMGlobals structure.  OEMInit does board-specific initialization, and can do one other important thing – start KITL.

If KITL is built into the NK.EXE, then its functions are directly accessible from NK.EXE.  If KITL is in a DLL, then that DLL will have been loaded by the kernel at the beginning of the architecture-neutral setup, as shown above.  In either event, the OEMInit() function can call a Kernel IO control saying that KITL should be started.  Based on whether the KITL.DLL was found or not, the kernel knows what to do.

Upon return from OEMInit(), the kernel is ready to start processes and threads to run.  It synchronizes its cache, and then enters the processor architecture’s service mode if it is not already running in it.  Then it does any one-time inits that do not require a current thread. These actions include:

  1. Enumerate available Memory  (optional call to OEMEnumExtensionDRAM() )
  2. Initialize critical sections in the kernel (critical section code uses the Interlocked API, the setup of which was discussed above).
  3. Initialize heap structures
  4. Initialize process and thread tracking structures
  5. Any other actions done before multi-threading is enabled.

After all single-threaded initialization is done, the kernel is ready to schedule the first thread.  This first thread is called “SystemStartupFunc()”, and lives in KERNEL.DLL.  To start the thread, the kernel specifies that there is no current thread to switch from, sets the first thread as the only one available to run, then calls into the thread scheduler code.   The scheduler code takes a look at all available threads and chooses the next one to run.  At this point in startup we only have one thread that has been manually set up to run, so that one is the one that is switched to.

 The SystemStartupFunc() function begins execution by flushing the system cache, then does things that require a ‘current’ thread to be running in order to happen.  These actions include:

  1. Initialize the system loader
  2. Initialize the paging pool
  3. Initialize system logging
  4. Initialize system debugger

The SystemStartupFunc() will call one more OEM function before it completes initialization – it will call the OEMIoctl() function through the function pointer in the OEMGlobals, with an argument ‘OEM_HAL_POSTINIT’.  This tells the NK.EXE that all system startup has completed and we are about to schedule threads and processes.

Upon exit from this first call to OEMIoctl(), the SystemStartupFunc() initializes the system message queue, any watchdogs, and then creates and starts the threads for the power manager and file system.  Thus, the rest of the higher-level parts of the operating system begin to execute here.  The last operation taken by the SystemStartupFunc() is to create another thread which executes the function “RunAppsAtStartup()”.  This function creates the first user processes.

We are now at the point where the kernel, power manager, and file system are all executing, and applications can begin to get executed that have been described to run in the system registry. 

This concludes the blog entry on how Windows Embedded CE 6.0 starts.  The internals of Windows CE are quite interesting and very well structured, and the startup process described above gives insight into the most critical system components.  In the future I hope to publish other articles on the internals of the system registry, the file system, and the device and power managers.

More Posts Next page »
 
Page view tracker