Buffer size alignment and the audio period

Buffer size alignment and the audio period

  • Comments 2

I got an email from someone today, paraphrased below:

Q: When I set the sampling frequency to 48 kHz, and ask Windows what the audio period is, I get exactly 10 milliseconds. When I set it to 44.1 kHz, I get very slightly over 10 milliseconds: 10.1587 milliseconds, to be precise. Why?
A: Alignment.

A while back I talked a bit about the WASAPI exclusive-mode alignment dance. Some audio drivers have a requirement that they deal in buffer sizes which are multiples of a certain byte size - for example, a common alignment restriction for HD Audio hardware is 128 bytes.

A more general audio requirement is that buffer sizes be a multiple of the size of a PCM audio frame.
For example, suppose the audio format of a stream is stereo 16-bit integer. A single PCM audio frame will be 2 * 2 = 4 bytes. The first two bytes will be the 16-bit signed integer with the sample value for the left channel; the last two bytes will be the right channel.
As another example, suppose the audio format of a stream is 5.1 32-bit floating point. A single PCM audio frame will be 6 * 4 = 24 bytes. Each of the six channels are a four-byte IEEE floating-point value; the channel order in Windows will be {Left, Right, Center, Low-Frequency Effects, Side Left, Side Right}.

The audio engine tries to run at as close to a 10 millisecond cadence as possible, subject to the two restrictions above. Given a "desired minimum interval" (in milliseconds), and a streaming format, and an "alignment requirement" in bytes, you can calculate the closest achievable interval (without going under the desired interval) as follows:

Note: this only works for uncompressed formats
aligned_buffer(desired_milliseconds, format, alignment_bytes)
    desired_frames = nearest_integer(desired_milliseconds / 1000.0 * format.nSamplesPerSec)
    alignment_frames = least_common_multiple(alignment_bytes, format.nBlockAlign) / format.nBlockAlign
    actual_frames = ceiling(desired_frames / alignment_frames) * alignment_frames
    actual_milliseconds = actual_frames / format.nSamplesPerSec * 1000.0

Here's a table of the actual buffer size (in frames and milliseconds), given a few typical inputs:

Desired (milliseconds) Format Alignment (bytes) Desired frames Alignment (frames) Actual (frames) Actual (milliseconds)
10 44.1 kHz stereo 16-bit integer 128 441 32 448 10.16
10 48 kHz stereo 16-bit integer 128 480 32 480 10
10 44.1 kHz 5.1 16-bit integer 128 441 32 448 10.16
10 48 kHz 5.1 16-bit integer 128 480 32 480 10
10 44.1 kHz 5.1 24-bit integer 128 441 64 448 10.16
10 48 kHz 5.1 24-bit integer 128 480 64 512 10.66

So to be really precise, the buffer size is actually 640/63 = 10.158730 milliseconds.

Leave a Comment
  • Please add 8 and 2 and type the answer here:
  • Post
  • Hi,

    You say that "The audio engine tries to run at as close to a 10 millisecond cadence as possible". Is this because the user has requested a 10 millisecond period or is it a general rule that the audio engine works best at 10 milliseconds period and 10 milliseconds should be used where possible ?

  • Good question.

    There isn't really a good reason for the 10ms cadence. Some hardware has a natural cadence corresponding to packet size (e.g., USB is 1ms). Some audio software sources and sinks would like higher or lower cadences, or even a variable cadence. And sometimes Windows likes to shut *everything* down for a while so it can do something like reprogram firmware in the network card.

    So 10ms is a good compromise solution. It's fast enough for voice communication but slow enough for the power guys.

Page 1 of 1 (2 items)