Welcome to MSDN Blogs Sign in | Join | Help

Here's a quick blog about an issue that we just hit today; most will merely find it interesting, but I hope it saves someone somewhere a little time, effort, and confusion.

We recently got a new codec library drop which we integrated into our mainline code tree. The codec team spends alot of time developing optimized ARM-versions of windows media codecs, and every once in awhile we get a new library that we need to integrate into our build system.

When we checked the libary into our source tree and ran a Smartphone build, we got roughly this error from one of our build tools:

    wmvdmod.dll(0) : fatal error RM0024 : Input File has more than 16 sections

In the ensuing investigation we discovered two things we hadn't previously known:

1. Our codec team has been subdividing their C/C++/Assembly language routines into multiple sections to keep certain code paths together and improve cache/page hit rates. As a result, they had created about 14 extra sections with names like ".decodeX_Pass1" (names changed to protect the innocent ;-). In general, one can view this type of information for any lib or dll by running "dumpbin -headers" on it.

2. Windows CE has some limitations on the number of sections that a module can contain (due to design decisions in the kernel and ROM image filesystems). Ultimately this results in a limit of 16 sections for some scenarios, which is the case we hit in our build tools.

The simplest short-term solution to this problem was to use the merge linker directive to force the linker to merge the different sections in the library back into the .text section. To accomplish this, we added something like the following to the appropriate sources file. This solved the build error without the need to rebuild the library (at the expense of removing all the goodness of using multiple sections to control code placement).

LDEFINES=$(LDEFINES) \
 -merge:.decodeX_Pass1=.text \
 -merge:.decodeX_Pass2=.text \
 -merge:.decodeY_Pass1=.text \
 -merge:.decodeY_Pass2=.text \
...

Note: I'm told one can accomplish the same feature within c/cpp files using #pragma comment(linker,"-merge:.foo=.bar")

In the ensuing discussion of how to fix this in the correct way (e.g. removing the restriction on the number of sections, or using fewer sections in the codec lib), our compiler/linker guru came down firmly on the side that there's no reason to need more than 16 sections (or really more than four or five), and noted that this whole situation could have been easily avoided using the following techniques:

For performance, if you want page alignment, use __declspec(align).  If you need to control code layout, use the linker’s /ORDER switch with a file containing the symbol ordering you need.  Alternatively, use the linker’s automatic sorting of section suffixes, e.g. .text$FOO_A, .text$FOO_B, and .text$FOO_C are automatically merged with .text in alphabetical order.

 

We didn't previously know about the linker options to automatically sort and merge sections using the $ delimiter, and I suspect that most other people don't either. We'll now go back to the codec team and suggest that future drops can just use the automatic sorting mechanism to ensure that code is grouped as needed while keeping all the code in the .text section. As a nice side benefit, grouping code into the same section saves on the amount of ROM required for the code. Each section must start on a 4k boundary, so on average each section will waste 2k or ROM. Note that section names are case sensitive, so .TEXT is not the same as .text. 

 

Here are some other related details about sections which I've shamelessly stolen from some other developers here at MS:

Paging:

Code may be paged into a size-limited RAM buffer called a "page pool". The page pool helps limit the RAM impact of code by keeping resident only the code pages currently in use. Code that must always stay resident in RAM can be marked as non-pageable, but this will cause the full extent of that code section to be copied into RAM for as long as the module is loaded.

To limit the footprint of a module in the page pool, it’s best to group the functions and constant data that are in the working set together.  This will allow the working set of code to exist in the page pool in the smallest number of pages. You can group them together using custom section naming. If section names are unique they will each be page-aligned (4k), so unless they truly need unique attributes, it’s best to name them such that automatic section merging can take place. Automatic section merging happens on sections named using a “section_name$subsection_name” convention, such that they all merge into one section named “section_name”.

 

For readability, give the subsection a name related to the grouping reason, such as “initialization”, “debug”, or “core”.

 

Example

To group function1 and function3 together in a custom subsection, you can do the following.

 

#pragma code_seg(".text$initialization")  // Code that follows goes into named subsection

void function1(void)  {return;}

#pragma code_seg()                        // Code that follows goes into default .text section

void function2(void)  {return;}

#pragma code_seg(".text$initialization")  // Code that follows goes into named subsection

void function3(void)  {return;}

#pragma code_seg()                        // Code that follows goes into default .text section

 

Non-Pageable Sections

If you need only a small bit of the code to stay in RAM always for performance or reliability reasons (like time-critical driver code), you can make the module partially pageable by creating a completely new section with custom attributes.

 

The following pragma defines a section called "NonPageableCode" which is set to non-pageable.

 

#pragma comment(linker, "/SECTION: NonPageableCode,ER!P")

 

There is also an newer, more readable way of specifying the section properties which has been available since CE5:

 

#pragma section("NonPageableCode", execute, read, nopage)

 

Now, in the source code, to make a section of code non-pageable, put the following line before the code:

 

#pragma code_seg("NonPageableCode")

 

Afterward, you may use the following line to force following code to be placed back in the default .text section:

 

#pragma code_seg()

 

Tools

DUMPBIN /HEADERS FOO.DLL (to see what sections exist in the module)

 

That's it for now.

 

I haven't seen this information consolidated online, so here it is:

 

A DLL Forwarder is used if you want to export an entry point from one DLL (or, more likely, for historical purposes you've already exported it from one dll), but you want to actually implement it in a different DLL.

 

For example, suppose you want to implement a function Foo() in a DLL I'll call impl.dll, but for whatever reason you need to export it from a DLL I'll call export.dll. Of course, one simple solution is to add code to export.dll to explicitly call into impl.dll. However, there are a couple of issues with this:

 

- There's an additional function call/return overhead.

- If the functions exported from the two DLLs have the same name, you might have some trouble convincing the linker to call the the Foo() in impl.dll from the Foo() in export.dll.

 

A forwarder solves this problem by directly interacting with the loader to forward exports from one DLL to another DLL without actually adding anything to the code path.

 

Forwarders are implemented in the .def file of the DLL you're forwarding through, and have the following syntax:

 

EXPORTS

    <FuncName>=<ForwardedDll>.<ForwardedFuncName>|#<ForwardedOrdinal> @<FuncOrdinal> NONAME

 

Where: 

<FuncName> is the name of the function as exported (e.g. from export.dll).

@<FuncOrdinal> is the ordinal of the function as exported (e.g. from export.dll). Note the use of the '@' symbol. Optional.

 

<ForwardedDll> is the name of the DLL into which you're forwarding the call (e.g. into impl.dll). Optional; if not specified, the forwarded function is assumed to be in this dll.

<ForwardedFuncName> is the name of the function in the ForwardedDll.

#<ForwardedOrdinal> is the ordinal of the function in the ForwardedDll. Note the use of the '#' symbol.

 

NONAME is the keyword that causes the linker to throw away the name of the function you're exporting so that it can only be referenced by its ordinal. This saves some space in the DLL and forces all callers to use the ordinal to link or GetProcAddress on the function. Optional.

Note: You need to specify the forwarded function name or ordinal (not both). You'll get slightly better load perf and smaller code size by specifying the ordinal. The ordinal is also necessary if the function in the DLL you're forwarding to is specified as NONAME in its def file.

 

Using our example, to forward Foo() from export.dll to impl.dll, the export.def file would have a line that looks like this:

 

EXPORTS 

    Foo = impl.Foo

 

If you run "dumpbin /exports" against export.dll, you should see an entry for showing the forward that looks something like this:

 

    ordinal  hint   RVA      name
    nnnn    mm                Foo (forwarded to Impl.Foo)

 

Tricky detail 1: When linking export.dll, the linker needs to figure out that the function you're exporting is a "C" style function, which it would normally do by looking at the function signature in the code implementation of the function. I've found that the easiest way to work around this is to implement a code stub that is linked into export.dll so it can get the right name in the export.lib file (e.g. decorated/undecorated). The actual code is thrown away at link time, so it doesn't contribute to the size of the Dll. For example, one would need to implement the following and link it into export.dll to make the linker happy.

 

#define FORWARD(fn) extern "C" void fn(){}

FORWARD(Foo)

 

Tricky detail 2: I've run into one issue with ROMIMAGE: if export.dll is in the modules section of the .bib file, but impl.dll is in the files section, ROMIMAGE will generate an error when it tries to resolve the import at makeimg time. This will likely be fixed in the future, but for now it's just something that needs to be avoided.

 

Tricky detail 3: Forwarders are used at load time to forward references between DLLs. The linker will not use forwarders at link time to resolve links within your DLL. Therefore, if the DLL you're forwarding from includes code with references to the function you're forwarding, the link phase of you DLL will fail with an unresolved extern error.

 

Tricky example: If you want to export it a function at ordinal 1000 (and not export it by name), and want to forward it to  DLL which exports it at ordinal 2000 (without a name), the syntax in your .def file is:

 

EXPORTS

    SHIM_ORD_1000=IMPL.@2000 @1000 NONAME

 

The SHIM_ORD_1000 is an arbitrary name; it's only there to satisfy the def file syntax rules. It doesn't really matter what you call it as long as it doesn't alias to anther function exported in your .def. If you then run dumpbin /exports on the resulting dll, you'll see something like:

 

    1000 [NONAME] (forwarded to IMPL.@2000)

 

 

In our previous entry, we talked about how video is synchronized to audio. In this short entry, we will talk about time stamps, master clocks, how adjustments to the master clock are made and how to deal with live streams.

About Reference Clocks and Stream Time

All fiters in a filter graph are synchronized to the same clock, the reference clock. The stream time is based off the reference time, but it is relative, and depends on which state the graph is. For instance, stream time doesn't move when the graph is paused; stream time goes back to 0 after a seek.

DirectShow provides a base class CBaseReferenceClock that implements the IReferenceClock interface. The base class clock object maintains two times internally:

  • internal private time
  • reference time

The internal private time is the actual time kept by the clock, and can be accessed through GetPrivateTime(). The internal private time can go backwards for brief periods of time. The reference time is based off the private time, and cannot go backwards.

Whenever a filter provides the reference clock, it will usually inherit from CBaseReferenceClock. It can either override the GetPrivateTime() function to return directly the time from the device (if available), or it can issue adjustments to the stream time through the SetTimeDelta() function. If it chooses the second method, it will need to monitor the difference between the system time and the time provided by the device.

The default reference clock in WindowsCE is provided by our audio renderer. It uses the SetTimeDelta() method to issue adjustments to the stream time. To change the reference clock in a filter graph, the interface IMediaFilter needs to be queried from the filter graph. Then use SetSyncSource() to change the reference clock.

All filters can access both the reference time and the stream time. The base filter class CBaseFilter has a m_pClock member. The reference time can be accessed by doing m_pClock->GetTime(), and the stream time is a member function of CBaseFilter, StreamTime().

About Time Stamps & Stream Time

The samples being processed in the filter graph may or may not have a time stamp, which is the media sample start and finish time. The time stamps are used in conjuction with the stream time. If a sample has a time stamp that is greater than the current stream time, it means that the sample is early. If a sample has a time stamp that is smaller than the current stream time, it is late. In a playback scenario, usually a splitter is the one that attaches time stamps to the samples. Filters may use time stamps for different reasons. For instance, time stamps may be used for presentation purposes, or to control the amount of buffering. The video renderer will use time stamps to schedule the samples for presentation, and thus, will end up throttling the video playback pipeline. When a sample arrives at the video renderer, there are several possibilities:

  • no timestamp - sample is scheduled immediately
  • in the future (timestamp > stream time) - video renderer needs to schedule the sample, and will usually call m_pClock->AdviseTime()
  • in the past (timestamp < stream time) - may render immediately, or not render at all.

The Reference Clock & The Audio Renderer

Let's assume in this section that the default Windows CE audio renderer is the reference clock. The audio renderer uses time stamps and stream time in a different way. Being the reference clock implies that the stream time is controlled by this component, so it will not follow a behavior similar to the video renderer.

In this case, as soon as the audio renderer receives a sample, it is ready to send it out to the audio driver. If the sample is late, it will drop it. Otherwise, it will send them immediately if there's buffer availability from the audio driver. It will never wait for the time to be right. For cases where the media sample times are not contiguous, that is, the end time of a sample is smaller than the start time of the next sample, then the audio renderer will write silence to the driver, and will wait until the start time for the second sample has arrived. In the normal scenario, there is going to be no space between media samples, so the audio renderer will write samples as fast as it can.

When the default audio renderer has finished processing a sample, it will read the device clock and the system clock, and compute the difference between them. Unfortunately, there is no way to get to the device clock directly, so the audio renderer uses the amsndOutGetPosition(), which can be imprecise. The audio renderer will accumulate differences, and will use a low pass filter on these differences. Whenever the average difference has gone above a certain threshold, then it will issue an adjustment to the stream time, through the usage of the SetTimeDelta() function. As soon as it does that, all filters calling GetTime() will receive the adjusted time - so the stream time will not be continuous. Note that all other filters that used m_pClock->AdviseTime() to get notified when a certain stream time has arrived (such as the video renderer) will not have to know of the stream time "change" that happened because of an adjustment. They will be advised when the stream time reached the desired value.

Live Sources & Clock Slaving

If the default Windows CE audio renderer is not the reference clock, it will write samples as it receives them. There's no automatic slave mode in WindowsCE, so the audio renderer will not wait for the time to be right before it sends the next sample.

For the case of live streams, there is one interface in our audio renderer that causes a speed up or slow down in the audio driver so that we try to match against the live source. The source filter will usually be using the IAudioRenderer->SetDriftRate() to control the audio matching speed. In this case, the audio renderer continues to be the master clock.

Another possibility when the audio renderer can't be the master clock is to simulate a slaving mode by inserting a filter in front of the audio renderer. This filter's responsibility would be to throttle the samples, so that they will be delivered just when it is almost time to send them to the audio driver. Of course, more complicated schemes are possible to try to match the rate of the incoming samples, but we will not go there here...

Foreword 

I've been working on a wavedev2 porting guide over the last few weeks and decided that it's better to post what I've go so far rather than wait until it's what I would consider finished. Expect future updates/additions as time allows, and feel free to ask for specific information in the comments. 

Overview

 

This whitepaper gives an overview of porting the wavedev2 audio driver to new hardware. For additional background on the history and features of wavedev2, please refer to other articles on http://blogs.msdn.com/medmedia.

Different Versions

 

Like most software, each release of Windows CE includes new features and bug fixes. For this article I’ll be referring to the version of wavedev2 which shipped with Windows CE 6 (AKA Yamazaki) under public\COMMON\oak\drivers\wavedev\wavedev2\ensoniq. This version is backward compatible with previous versions and the porting process is comparable. I’ve included some notes on the differences between Windows CE 6 and previous implementations at the end.

File Layout 

All the files needed to build wavedev2 are in a single directory. For porting purposes, these files can be grouped into the following categories:

1.       Files which are device independent, and which you should not need to touch during the porting process beyond just copying them:

audiosys.h: Proprietary wave message definitions used by wavedev2.

devctxt.cpp: Implementation of device context class.

devctxt.h: Definition of device context class.

input.cpp: Implementation of audio input streams.

makefile: Used by build system

midinote.cpp: Implementation of tone generator.

midistrm.cpp: Implementation of MIDI stream and MIDI parser.

midistrm.h: Definition of MIDI note and stream classes.

mixerdrv.cpp: Implementation of Mixer API classes (* may need to change if you want to take advantages of mixer API extensions).

mixerdrv.h: Definition of MIXER API classes.

output.cpp: Implementation of audio output streams.

strmctxt.cpp: Implementation of base audio stream class.

strmctxt.h: Definition of base, input, output stream classes.

wavemain.cpp: Device driver interface.

wavemain.h: Common include header used by all source files.

wavepdd.h: Basic PCM sample definitions.

wfmtmidi.h: MIDI structure definitions.

2.       Files which are device dependent but are logically part of the wavedev2 driver infrastructure. These files will need to be copied and modified during the port. These files are:

 

hwctxt.cpp: Implementation of hardware context class.

hwctxt.h: Definition of hardware context class.

oemsettings.h: HW-specific definitions used by hw-independent code.

sources: Used by build system.

wavedev2_ensoniq.def: Driver exports. Probably just need to rename.

wavedev2_ensoniq.reg: Registry entries used to install driver.

3.       Files which are device dependent and are not logically part of the driver infrastructure. You can ignore these files in your port (unless they happen to be appropriate to your hardware). For the Ensoniq sample driver, these are:

 

AC97.H: AC97 codec-specific definitions.

Es1371.cpp: Ensoniq 1371-specific functions.

Es1371.h: Ensoniq 1371-specific header.

Hw_ac97.cpp: AC97 codec-specific functions.

 

In the above list, note that the vast majority of files shouldn’t need to be touched, and there is really only one source file and two headers you will need to modify to bring up a driver.

Class Descriptions

Class Overview

 

The wavedev2 driver largely consists of three main base classes:

HardwareContext

 

This class represents the actual audio hardware. This is the only class you will typically need to modify to port the driver. It is the only device dependent class, and takes care of hardware initialization, power management, DMA and Codec control, handling audio interrupts, and any other proprietary features the driver may implement. There is one instantiated HardwareContext object in the driver which is pointed to by the g_pHWContext global variable. I’ll go into detail about each HardwareContext’s methods later.

DeviceContext

 

This class represents a specific audio device. You should not have to modify anything in this class to port the driver. There is a DeviceContext virtual base class, from which are derived an InputDeviceContext and an OutputDeviceContext. A typical wavedev2 audio driver (such as the Ensoniq driver) implements a single input device represented by an InputDeviceContext; and a single output device represented by an OutputDeviceContext. In the Ensoniq sample, these objects are directly embedded within HardwareContext class as member variables.

DeviceContext methods include:

StreamContext

 

This class represents a specific audio stream. You should not have to modify anything in this class to port the driver. There is a StreamContext virtual base class from which are derived a variety of stream classes for various flavors of PCM audio and MIDI data. Each stream is associated with a specific device context. This association is implemented as a linked list of stream contexts hanging off of each device context. In addition, each stream context includes a pointer back to its associated device context.

The class hierarchy is roughly as follows:

StreamContext

        CMidiStream 

        WaveStreamContext

                InputStreamContext

                OutputStreamContext

                    OutputStreamContextM8

                    OutputStreamContextM16

                    OutputStreamContextS8

                    OutputStreamContextS16

The reason for the multitude of output contexts is that the mixing/sample-rate-conversion code on the output side is optimized for each type of PCM data (Stereo/Mono, 8/16-bit samples). This avoids some tests in the inner loop. The same optimization wasn’t done for the input side (input isn’t typically used as often as output, and the code is a little simpler).

StreamContextMethods include:

Porting HardwareContext

 

In this section, I’ll go over each of the methods in HardwareContext and describe what they do.

HardwareContext::CreateHWContext

This is a static method which is called during driver initialization (from the WAV_Init code in wavemain.cpp). This function should create and initialize the global g_pHWContext with a new HardwareContext object and call g_pHWContext->Init. You probably won’t need to change this function, as most changes will be in the Init method.

HardwareContext::Init

This method is only called by CreateHWContext, and is where initialization of the Hardware is typically implemented.  Portions of this function may need to be modified for new hardware. Its role is to initialize any hardware, allocate DMA buffers, and startup the interrupt service thread. In addition, during initialization it needs to call into some of the device independent sections to initialize them; specificially:

 

-          Call SetBaseSampleRate on each device context to tell it what sample rate the hardware is running at. Note that these functions can be called at any time to tell the device context that the hardware sample rate has changed, but for devices with a fixed sample rate setting this up during initialization makes sense.

-          Call InitMixerControls to initialize the Mixer API support.

HardwareContext::Deinit

This method is called when the driver is unloaded and the system calls WAV_Deinit. In the current design, wave drivers are never unloaded so this method has limited usefulness.

HardwareContext::UpdateOutputGain

HardwareContext::UpdateInputGain

HardwareContext::SetOutputGain

HardwareContext::SetOutputMute

HardwareContext::GetOutputGain

HardwareContext::GetOutputMute

HardwareContext::GetInputMute

HardwareContext::SetInputMute

HardwareContext::GetInputGain

HardwareContext::SetInputGain

 

These methods are associated with the master input and output gain controls provided by the default mixer API implementation and the device gain waveOutSetVolume API. In the Ensoniq implementation these defer processing to the DeviceContext SetGain methods, which automatically handle volume control in software. There is no need to modify the existing code unless you want to handle some aspects of volume control in hardware. However, keep in mind that individual stream gain controls are still handled in software, and there is no additional overhead in handling device gain as well. Therefore, there is no performance advantage in modifying this code to use hardware gain controls.

HardwareContext::StartOutputDMA

This method starts the DMA controller for audio output. This includes:

1.       Check to see if output dma is already running and ignore the call if it is.

2.       Clear the variables that track how much “live” data is in each DMA buffer.

3.       “Prime” the output DMA buffer with data.

4.       Start the DMA channel if (and only if) data was available to be transferred.

The only line you should need to change is the one that specifically turns on the DMA channel, which in the Ensoniq implementation is:

m_CES1371.StartDMAChannel( ES1371_DAC0 );

HardwareContext::StopOutputDMA

                This method stops the audio output DMA controller. The only line you should need to change is:

m_CES1371.StopDMAChannel( ES1371_DAC0 );

HardwareContext::StartInputDMA

HardwareContext::StopInputDMA

 

These methods are analogous to the methods described above for the output DMA. However, note that the code to start the input DMA doesn’t need to “prime” the buffer or keep track of how much application data is in the buffer, and is therefore somewhat simpler than the output case.

 

HardwareContext::GetDriverRegValue

HardwareContext::SetDriverRegValue

These methods relate to reading driver-specific registry keys. You should not need to change them.

 

HardwareContext::InitInterruptThread

This method initializes the audio driver’s IST thread and sets it to a realtime priority. If your driver has a single IST thread shared by both input and output you will not need to modify this code.

 

HardwareContext::PowerUp

HardwareContext::PowerDown

These methods are called by the system’s power management subsystem. In the Ensoniq driver they are stubbed out.

 

HardwareContext::TransferInputBuffer

HardwareContext::TransferOutputBuffer

This method is called from the IST to transfer one data into our out of the DMA buffers. The code determines the starting address and size of the DMA buffer and passes the information to the device context, which performs the actual transfer. You will not need to modify this code unless you change the organization or data structures representing the DMA buffers.

 

HardwareContext::InterruptThread

This method implements the Interrupt Service Thread which is shared by both input and output DMA. It’s operation is basically:

1.       Wait for an input or output DMA done interrupt.

2.       Determine whether an input or output (or both) DMA interrupt occurred.

3.       If an output DMA interrupt occurred:

§  Transfer/mix application data into the DMA buffer that was just completed.

§  If there is no application data remaining in either DMA buffer, halt output DMA

4.       If an input DMA interrupt occurred:

§  Transfer data out of the DMA buffer that was just completed into application buffers.

§  If we were unable to transfer any data (due to no application buffer being available), halt input DMA.

5.       Go back to step 1.

 

HardwareContext::SetSpeakerEnable

HardwareContext::RecalcSpeakerEnable

HardwareContext::ForceSpeaker

These methods handle the WODM_FORCESPEAKER message which may be used to request that audio data be routed to an auxiliary speaker on the back of the phone (this speaker is typically larger and more powerful than the earpiece speaker, and is used for ringtones). If you hardware supports this functionality, you will need to add code to the SetSpeakerEnable to switch the speaker on or off.

 

HardwareContext::PmControlMessage

This method receives messages from the Power Manager IOCTL calls:

IOCTL_POWER_CAPABILITIES

IOCTL_POWER_QUERY

IOCTL_POWER_SET        

IOCTL_POWER_GET

 

                You probably will not need to modify this code.

 

HardwareContext::IsSupportedOutputFormat

This method is called during waveOutOpen to allow the OEM to support additional custom audio formats beyond the standard PCM functionality. Normally this method should just return FALSE. The Ensoniq driver supports directly playing WMAPro compressed audio content over its S/PDIF interface and therefore returns TRUE for this specific case.

DMA Buffer Organization and Data Transfer

 

In the Ensoniq implementation, which is fairly typical, input and output are each allocated a DMA buffer using HalAllocateCommonBuffer. The size of each buffer is only 4k (the same as a memory page), so it’s unlikely that the allocation will fail (especially since it takes place during boot). Other implementations may choose to preallocate a fixed area of memory for the audio DMA buffers.

During audio transfer, each DMA buffer is logically subdivided into equally-sized DMA pages 0 and 1, and the hardware is programmed to:

a.       Transfer nonstop from the DMA buffer to the codec, and automatically reload the DMA address register with the start address of the buffer when it reaches the end.

b.      Generate an interrupt to the audio system whenever the DMA address moves either past the midpoint of the buffer (e.g. from page 0 to page 1), or reaches the end and restarts itself (e.g. from page 1 to page 0).

On each DMA interrupt, the HardwareContext code needs to determine which DMA page the DMA controller has just finished copying data into/out of and call the DeviceContext’s TransferBuffer method to copy application data into/out of that buffer.

Buffer Security/Copying

 

(Still working on this section) 

Support for S/PDIF

 

(Still working on this section) 

Differences between Windows CE 5 & Windows CE 6

 

All current versions of Windows Mobile (including Windows Mobile 6) are based on Windows CE 5 or earlier OS releases. However, the most recent version of wavedev2 is shipped with Windows CE 6, and that’s the version I’m examining for this guide. Therefore, it’s important to touch on differences between how the two OS’es interact with wave drivers.

For the most part, the audio driver architecture between CE5 and CE6 is the same. Audio drivers written for CE5 can generally run with little or no modification on CE6.

Virtual Addressing Differences

 

When porting the Ensoniq wavedev2 driver from CE5 to CE6, the only change specifically related to moving between operating systems was to surround the call to SetProcPermissions in hwctxt.cpp as follows:

#if (_WINCEOSVER < 600)

    SetProcPermissions((DWORD)-1);

#endif

 

On Windows CE6, the API’s SetProcPermissions and GetProcPermissions are no longer supported due to changes in the virtual memory architecture. They are still exported for backward-compatibility purposes, but they will have no affect (other than printing out a nasty warning message on the debugger). This change bears a little explanation:

On pre-Windows CE6 systems there is a limit of 32 processes, and all processes run in a shared virtual memory space. The system provides cross-process protection to ensure that processes don’t access each other’s memory, and this protection is enforced on a per-thread basis. Device drivers (such as the wave driver) run inside device.exe, which is one of these 32 processes. The Interrupt Service Thread (IST) in the driver is responsible for accessing audio data in various application buffers residing in multiple processes. The audio driver’s IST overrides this protection by calling SetProcPermissions(0xFFFFFFFF); each bit in the parameter represents one of 32 processes in the system, so 0xFFFFFFFF enables access to all of them.

Windows CE 6 adopts a more traditional memory architecture, with the kernel taking the upper 2GB of virtual space and each user process occupying the same lower 2GB region. Switching between user processes involves swapping a new process into the lower 2GB. While this greatly expands the amount of virtual memory available to each user process, it also means that the IST thread (now running in the kernel) may no longer freely access other arbitrary process’ address space.

To solve this problem, the waveapi middleware (which sits above the audio driver on the stack) now takes care of mapping each application data buffer into a kernel (via CeAllocAsynchronousBuffer). The memory mapping/unmapping take place during waveOutPrepareHeader/waveOutUnprepareHeader, so the cost of memory management doesn’t impact performance.

 

The video renderer is the last filter in the video pipe, and it is responsible for displaying the output of upstream filters. The video renderer is just a controller for the underlying display driver, and does not do any processing on the image samples themselves.

The video renderer operates in two distinct modes:

  • GDI
  • DirectDraw

When the graph is first connected, the video renderer always tries to connect using GDI, and for that, it will need a connection with an RGB media type that matches the display format of the primary monitor. Just when it goes into Paused mode, the video renderer will try to allocate surfaces using DirectDraw. This dual mode of operation was envisioned so as to always have a fall back plan in case DirectDraw surfaces were not available in some circumstances.

Choosing an Accelerated Media Type

When the video renderer goes into Paused mode, it is time to allocate the DirectDraw surfaces. The video renderer will do so by enumerating all media types of the upstream filter, and then trying to allocate a surface matching that media type. For instance, let's assume the upstream filter is the WMV DMO. It currently supports the following output media types (the preferred media type is the first one):

  • YV12
  • NV12
  • YUY2
  • I420
  • IYUV
  • UYVY
  • YVYU
  • RGB565
  • RGB555
  • RGB32
  • RGB24
  • RGB8

The video renderer will try to allocate flipping overlay surfaces first, then non-flipping surfaces:

    • For each media type of the upstream filter, in the order dictated by the upstream filter
      • try to allocate a flipping surface of that media type
      • If it succeeds, call QueryAccept on the upstream filter's output pin
      • If it succeeds, use it
    • If the previous didn't succeed, try to allocate a primary flipping surface (if enabled)
      • If it succeeds, call QueryAccept on the upstream filter's output pin
      • If it succeeds, use it
    • If the previous didn't succeed, for each media type of the upstream filter, in the order dictated by the upstream filter
      • try to allocate a surface (not flipping) of that media type
      • If it succeeds, call QueryAccept on the upstream filter's output pin
      • If it succeeds, use it

In this way, if the upstream filter has optimized for certain YUV formats, it can control the choice of media type. In case the display driver can also provide a surface of that type, the accelerated media type is chosen. The whole process is driven by the upstream filter, with the display driver in a passive role.

Dynamic Format Changes from the Video Renderer

Of course, for an optimal pipe, we would like to always have overlay flipping surfaces available. Nevertheless, that may not be the case in some situations. For instance, depending on the display driver capabilities, the flipping overlay may just be available when the user is watching the video at its original size. When the user is stretching or shrinking, overlays might not be available. This is controlled by the DirectDraw hardware capabilities dwMinOverlayStretch and dwMaxOverlayStretch (see http://msdn2.microsoft.com/en-us/library/aa915204.aspx). So, if the display driver doesn't support overlay stretching, and the video renderer is currently using overlays, it will need to swap to GDI (and thus to RGB format), so that GDI will do the necessary scaling.

Note that every time the upstream filter requests a new buffer from the video renderer, the video renderer will try to return a DirectDraw buffer. If all the conditions to use the DirectDraw buffer are OK (clipping, stretching, video memory, etc.), then it will use it. Just in case one of the conditions fail it will resort to using GDI.

Debugging Video Renderer Connection Problems

We have seen some common connection problems when initially bringing up new decoder filters and/or capture drivers:

  • Color space converter is inserted in the graph
  • Video renderer doesn't connect
  • YUV surfaces are not being used, just GDI

Analyzing the DirectShow logs 

The first step in this case is to turn on the debug zones for the DirectShow DLL, quartz.dll, and observe the connection and video renderer messages. 

Run your test scenario, and save the debug output. Look for the section that says "Filter Graph Dump", and verify which filters got inserted in the graph. Here's an example of a filter graph dump:

Filter graph dump
Filter 1a199a30 'Video Renderer' Iunknown 1a199a20
    Pin 1a199f10 Input (Input) connected to 1a0e1880
Filter 1a0e1200 'WMVideo & MPEG4 Decoder DMO' Iunknown 1a0e11f0
    Pin 1a0e16e0 in0 (Input) connected to 1a0e0600
    Pin 1a0e1880 out0 (PINDIR_OUTPUT) connected to 1a199f10
    Pin 1a0e1a00 ~out1 (PINDIR_OUTPUT) connected to 0
Filter 1a0ecc60 'ASF ICM Handler' Iunknown 1a0ecc50
    Pin 1a0ecd70 In (Input) connected to 1a0aa3a0
    Pin 1a0e0600 Out (PINDIR_OUTPUT) connected to 1a0e16e0
Filter 1a0ec240 'Audio Renderer' Iunknown 1a0ec230
wo: GetPin, 0
    Pin 1a0ec4e0 Audio Input pin (rendered) (Input) connected to 1a0eb880
Filter 1a0eb220 'WMAudio Decoder DMO' Iunknown 1a0eb210
    Pin 1a0eb660 in0 (Input) connected to 1a0ea800
    Pin 1a0eb880 out0 (PINDIR_OUTPUT) connected to 1a0ec4e0
Filter 1a0e9380 'ASF ACM Handler' Iunknown 1a0e9370
    Pin 1a0e9490 In (Input) connected to 1a0aa000
    Pin 1a0ea800 Out (PINDIR_OUTPUT) connected to 1a0eb660
Filter 1a0a2ae0 '\Hard Disk2\clips\wmv\0-1.asf' Iunknown 1a0a2ad0
    Pin 1a0aa000 Stream 1 (PINDIR_OUTPUT) connected to 1a0e9490
    Pin 1a0aa3a0 Stream 2 (PINDIR_OUTPUT) connected to 1a0ecd70
End of filter graph dump

After that, verify which media type the video renderer is using when trying accelerated mode (and if it succeeded). Search for "Allocating video resources":

Allocating video resources
Initialising DCI/DirectDraw
Searching for direct format
Entering ReleaseSurfaces
Entering HideOverlaySurface
Enumerated 32315659
Entering FindSurface
Entering GetMediaType
Not a RGB format
Entering CreateYUVFlipping
Entering CheckCreateOverlay
GWES Hook fails surface creation. IDirectDraw::CreateSurface fails.
No surface
Entering ReleaseSurfaces
Entering HideOverlaySurface
Enumerated 3231564e
Entering FindSurface
Entering GetMediaType
Not a RGB format
Entering CreateYUVFlipping
Entering CheckCreateOverlay
GWES Hook fails surface creation. IDirectDraw::CreateSurface fails.
No surface
Entering ReleaseSurfaces
Entering HideOverlaySurface
Enumerated 32595559
Entering FindSurface
Entering GetMediaType
Not a RGB format
Entering CreateYUVFlipping
Entering CheckCreateOverlay
Entering InitOverlaySurface
Entering InitDrawFormat
Entering InitDrawFormat
Entering GetDefaultColorKey
Returning default colour key
Entering InitDefaultColourKey
Entering SetSurfaceSize
Preparing source and destination rectangles
Entering ClipPrepare
Entering InitialiseClipper
Entering InitialiseColourKey
overlay color key on
Colour key
No palette
Found AMDDS_YUVFLP surface
Proposing output type  M type MEDIATYPE_Video  S type MEDIASUBTYPE_YUY2

Note in the above log that the video renderer tried to create surfaces in the order specified by the WMV DMO. For the display driver in use for the above log, it managed to create a YUY2 surface, the third option for the WMV decoder. The last section in this blog entry has more information about FourCC codes.

Here are some solutions for common connection problems we have faced in the past.

Color Space Converter is inserted in the graph 

The number one problem is that the upstream filter doesn't report any RGB format, just YUV formats. If that's the case, the video renderer can't connect directly to the filter since it requires a matching RGB format. Usually, the color space converter will be inserted in the graph in these cases. We don't want this to happen, as it will imply a memory copy of each frame buffer, so we want to make sure the upstream filter does provide RGB formats.

Sometimes the color converted gets inserted in the graph even though the upstream filter does support the needed RGB format. This can happen because the upstream filter is requiring an alignment different than 1 when the allocator is being decided. Currently, the video renderer will just accept 1-byte alignments.

Another common reason for the color converter to be inserted in the graph is when the BITMAPINFOHEADER supplied by the upstream filter doesn't contain the bitmasks correctly at the end of BITMAPINFOHEADER that is passed when getting the output media types. Please make sure that the bitmasks are inserted correctly. For instance, for RGB565, we should have:

        *pdwBitfield++  = 0xF800;       // Red – 5

 *pdwBitfield++  = 0x07E0;       // Green - 6

 *pdwBitfield    = 0x001F;       // Blue - 5

Graph doesn't connect at all

If the upstream filter just supports a subset of YUV formats, and none of these are recognized by the color space converter, then it won't be possible at all to connect the video renderer. Again, in this case the solution is for the upstream filter to provide RGB formats.

YUV Surfaces are not used, just GDI

Another common occurrence is for the upstream filters to provide allocators. If this is the case, the video renderer will be tied to not using DirectDraw (as it can't pass upstream memory buffers to DirectDraw). If we want the optimal overlay flipping path, the video renderer *needs* to be the allocator, so that it is possible for it to provide DirectDraw surfaces upstream.

Surface Types: Controlling Which Surfaces the Video Renderer Creates

There are ways to control which accelerated surfaces the video renderer is allowed or not to create that are useful when debugging the connection process, specially to reduce the number of options and the number of tries in the display driver. This is controlled via a registry key (see http://msdn2.microsoft.com/en-us/library/aa930626.aspx):

HKEY_LOCAL_MACHINE\Software\Microsoft\DirectX\DirectShow\Video Renderer\SurfaceTypes

The following table shows the AMDDS values for use with the SurfaceTypes named value.

Flag Hexadecimal value Description

AMDDS_NONE

0x00

No support for Device Control Interface (DCI) or DirectDraw.

AMDDS_DCIPS

0x01

Use DCI primary surface.

AMDDS_PS

0x02

Use DirectDraw primary surface.

AMDDS_RGBOVR

0x04

RGB overlay surfaces.

AMDDS_YUVOVR

0x08

YUV overlay surfaces.

AMDDS_RGBOFF

0x10

RGB off-screen surfaces.

AMDDS_YUVOFF

0x20

YUV off-screen surfaces.

AMDDS_RGBFLP

0x40

RGB flipping surfaces.

AMDDS_YUVFLP

0x80

YUV flipping surfaces.

AMDDS_ALL

0xFF

Use all available surfaces.

AMDDS_DEFAULT

0xFF

Use all available surfaces.

AMDDS_YUV

0xA8

(AMDDS_YUVOFF | AMDDS_YUVOVR | AMDDS_YUVFLP)

AMDDS_RGB

0x58

(AMDDS_RGBOFF | AMDDS_RGBOVR | AMDDS_RGBFLP)

AMDDS_PRIMARY

0x03

(AMDDS_DCIPS | AMDDS_PS)

If you just want to enable YUV overlay flipping surfaces for debugging purposes, you should set the SurfaceTypes registry key to AMDDS_YUVFLP. Remember to turn on all surfaces back on after you finished debugging your problem...

About FourCC codes:

Note that in the example log we list the FOURCC codes that are being used in the line "Enumerated 32315659". Here's how to map this hex number into a character sequence that will help identify the code:

 Enumerated 32315659

0x32 = '2', 0x31 = '1', 0x56 = 'V', 0x59 = 'Y' ===> 0x32315659 = YV12

 Enumerated 3231564e

0x32 = '2', 0x31 = '1', 0x56 = 'V', 0x4e = 'N' ===> 0x3231564e = NV12

 Enumerated 32595559

0x32 = '2', 0x59 = 'Y', 0x55 = 'U', 0x59 = 'Y' ===> 0x32595559 = YUY2

 Enumerated 56555949

0x56 = 'V', 0x55 = 'U', 0x59 = 'Y', 0x49 = 'I' ===> 0x56555949 = IYUV

 Enumerated 59565955

0x59 = 'Y', 0x56 = 'V', 0x59 = 'Y', 0x55 = 'U' ===> 0x59565955 = UYVY

 Enumerated 55595659

0x55 = 'U', 0x59 = 'Y', 0x56 = 'V', 0x59 = 'Y' ===> 0x55595659 = YVYU

Also, the file in public\directx\sdk\inc\uuids.h contains several FOURCC media subtypes definitions:

  • // 32595559-0000-0010-8000-00AA00389B71 'YUY2' == MEDIASUBTYPE_YUY2
  • OUR_GUID_ENTRY(MEDIASUBTYPE_YUY2,
  • 0x32595559, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 55595659-0000-0010-8000-00AA00389B71 'YVYU' == MEDIASUBTYPE_YVYU
  • OUR_GUID_ENTRY(MEDIASUBTYPE_YVYU,
  • 0x55595659, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 59565955-0000-0010-8000-00AA00389B71 'UYVY' == MEDIASUBTYPE_UYVY
  • OUR_GUID_ENTRY(MEDIASUBTYPE_UYVY,
  • 0x59565955, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 31313259-0000-0010-8000-00AA00389B71 'Y211' == MEDIASUBTYPE_Y211
  • OUR_GUID_ENTRY(MEDIASUBTYPE_Y211,
  • 0x31313259, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 32315659-0000-0010-8000-00AA00389B71 'YV12' == MEDIASUBTYPE_YV12
  • OUR_GUID_ENTRY(MEDIASUBTYPE_YV12,
  • 0x32315659, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 36313259-0000-0010-8000-00AA00389B71 'YV16' == MEDIASUBTYPE_YV16
  • OUR_GUID_ENTRY(MEDIASUBTYPE_YV16,
  • 0x36315659, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 56595549-0000-0010-8000-00AA00389B71 'IUYV' == MEDIASUBTYPE_IUYV
  • OUR_GUID_ENTRY(MEDIASUBTYPE_IUYV,
  • 0x56595549, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 3231564E-0000-0010-8000-00AA00389B71 'NV12' == MEDIASUBTYPE_NV12
  • OUR_GUID_ENTRY(MEDIASUBTYPE_NV12,
  • 0x3231564E, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 30323449-0000-0010-8000-00AA00389B71 'I420' == MEDIASUBTYPE_I420
  • OUR_GUID_ENTRY(MEDIASUBTYPE_I420,
  • 0x30323449, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)
  • // 56555949-0000-0010-8000-00AA00389B71 'IYUV' == MEDIASUBTYPE_IYUV
  • OUR_GUID_ENTRY(MEDIASUBTYPE_IYUV,
  • 0x56555949, 0x0000, 0x0010, 0x80, 0x00, 0x00, 0xaa, 0x00, 0x38, 0x9b, 0x71)

Please do leave feedback, and let us know if this has been useful. Thanks,

Lucia

All filters in a DirectShow graph should be synchronized to the same clock, the reference clock. The filter graph manager makes sure that it finds one component that will be the reference clock, in the following simplified order: user-specified clock, renderer (usually audio renderer), or system clock if none available before.

The stream time is based off the reference clock, but relative to the time the graph last started running (so the stream time doesn't move if the graph is paused). If a media sample that enters a renderer has a time stamp t, then it means it should be rendered at stream time t. This is the basic mechanism by which a/v synchronization occurs.

There is usually a crystal in the audio hardware though, and no guarantees that the hardware timer will match the system clock. That's why usually we have the audio renderer being the reference clock for the whole DirectShow graph. If the audio renderer receives a sample late, or if the audio clock is consistently drifting from the system clock, then the audio renderer will issue stream time adjustments.

An audio renderer implementation will usually inherit from the CBaseReferenceClock class, and will call SetTimeDelta() function whenever it needs to do an adjustment to the stream time. Note that it should use a low pass filter before sending adjustments to the master clock so that no unnecessary jittering is introduced.

As the video renderer uses the incoming timestamps to schedule samples for presentation, the scheduler is based off stream time, and the audio renderer has control to change the stream time, the video and audio renderer will be using the same timeline.

About the Video Renderer & Frame Dropping

If the video is running slow, and all video frames are being rendered, then theoretically the video renderer will receive samples with timestamps in the past and schedule them for immediate rendering. If this situation continues to happen, what will happen is that the video is going to be behind audio. This shows the need for frame dropping.

In fact, audio and video synchronization in DirectShow works by a combination of two elements:

  • Audio renderer controlling the DirectShow stream time;
  • IQualityControl and IDMOQualityControl interfaces guiding frame dropping algorithm

Dropping frames at the video renderer level is of course not very effective. If using overlay flipping surfaces, for instance, dropping a frame doesn't get you much farther trying to catch up (because the flipping itself is very cheap). Even in the case of Blits, it is still going to help very little (rendering time is small compared to decoding time). That's why there is the need to indicate the state and lateness of the renderer to upstream filters/DMOs, which is done through the quality notification messages.

The video renderer originates the notification messages (since it is the filter that needs to run in real time), and sends them upstream. If the upstream filter is a decoder, and it can handle it, it doesn't pass the message upstream. If it can't handle, then it passes it upstream. Note that the video renderer will drop frames anyway if it is very late.

Here's a coarse example of how to use the Quality interface to be able to drop frames in a decoder filter:

HRESULT CDecoderFilterPin::Notify(IBaseFilter *pSender, Quality q)

{

       if (quality sink has been set)           // m_pQSink

       {

              status = Pass Notify on the quality sink   (base sender is the decoder filter now)

}

else

{

if (has frame dropping algorithm)

              {

                     status = Call decoder filter to do frame dropping

              }

              else

              {

                     if (upstreamQualityControl)

                     {

                           status = Pass Notify() on to upstream quality control interface (base sender is the decoder filter now)

                     }

                     else

                     {

                            status = not handled;

                   &