Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What IS audio on a PC anyway?

What IS audio on a PC anyway?

  • Comments 39

This may be well known, but maybe not (I didn’t understand it until I joined the Windows Audio team).

Just what is digital audio, anyway?  Well, at its core, all of digital audio is a “pop” sound made on the speaker.  When you get right down to it, that’s all it is.  A “sound” in digital audio is a voltage spike applied to a speaker jack, with a specific amplitude.  The amplitude determines how much the speaker diaphragm moves when the signal is received by the speaker.

That’s it, that’s all that digital audio is – it’s a “pop” noise.  The trick that makes it sound like Sondheim is that you make a LOT of pops every second – thousands and thousands of pops per second.  When you make the pops quickly enough, your ear puts the pops together to turn them into a discrete sound.  You can hear a simple example of this effect when you walk near a high voltage power transformer.  AC power in the US runs at 60 cycles per second, and as the transformer works, it emits a noise on each cycle.  The brain smears that 60 Hz sound together and turns it into the “hum” that you hear near power equipment.

Another way of thinking about this (thanks Frank) is to consider the speaker on your home stereo.  As you’re listening to music, if you pull the cover off the speaker, you can see the cone move in and out with the music.  Well, if you were to take a ruler and measure the displacement of the cone from 0, the distance that it moves from the origin is the volume of the pop.  Now start measuring really fast – thousands of times a second.  Your collected measurements make up an encoded representation of the sound you just heard.

To play back the audio, take your measurements, and move the cone the same amount, and it will reproduce the original sound.

Since a picture is worth a thousand words, Simon Cooke was gracious enough to draw the following...

Take an audio signal, say a sine wave:

Then, you sample the sine wave (in this case, 16 samples per frequency):

Each of the bars under the sine wave is the sample.  When you play back the samples, the speaker will reproduce the original sound.  One thing to keep in mind (as Simon commented) is that the output waveform doesn't look quite like the stepped function that the samples would generate.  Instead, after the Digital-to-Audio-Converter (DAC) in the sound card, there's a low pass filter that smooths the output of the signal.

When you take an analog audio signal, and encode it in this format, it’s also known as “Pulse Coded Modulation”, or “PCM”.  Ultimately, all PC audio comes out in PCM, that’s typically what’s sent to the sound card when you’re playing back audio.

When an analog signal is captured (in a recording studio, for example), the volume of the signal is sampled at some frequency (typically 44.1 kHz for CD audio).  Each of the samples is captured with a particular range of amplitudes (or quantization).  For CD audio, the quantization is 16 bits, in two samples.  Obviously, this means that each sample has one of at most 65,536 values, which is typically enough for most audio applications.  Since the CD audio is stereo, there are two 16 bit values for each sample. 

Other devices, like telephones, on the other hand, typically uses 8 bit samples, and acquires their samples at 8kHz – that’s why the sound quality on telephone communications is so poor (btw, telephones don’t actually use direct 8 bit samples, instead their data stream is compressed using a format called mu-law (or a-law in Europe), or G.711).  On the other hand, the bandwidth used by typical telephone communication is significantly lower than CD audio – CD audio’s bandwidth is 44,100*16*2=1.35Mb/second, or 176KB/second.  The bandwidth of a telephone conversation is 64Kb/second, or 8KB/second (reduced to from 3.2Kb/s to 11Kb/s with compression), an order of magnitude lower.  When you’re dealing with low bandwidth networks like the analog phone network or wireless networks, this reduction in bandwidth is critical.

It’s also possible to sample at higher frequencies and higher sample sizes.  Some common sample sizes are 20bits/sample and 24bits/sample.  I’ve also seen 96.2 kHz sample frequencies and sometimes even higher.

When you’re ripping your CDs, on the other hand, it’s pointless to rip them at anything other than 44.1 kHz, 16 bit stereo, there’s nothing you can do to improve the resolution.  There ARE other forms of audio that have a higher bit rate, for example, DVD-Audio allows samples at 44.1, 48, 88.2, 96, 176.4 or 192 kHz, and sample sizes of 16, 20, or 24 bits/sample, with up to 6 96 kHz audio channels or 2 192 kHz samples.

One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done to the data to reduce the size of the audio data.  But in most cases, when the data finally hits your sound card, it’s represented as PCM data (this isn’t always the case, for example, if you’re using the SPDIF connector on your sound card, then the data sent to the card isn’t PCM).

Edit: Corrected math slightly.

Edit: Added a couple of pictures (Thanks Simon!)

Edit3: Not high pass, low pass filter, thanks Stefan.

  • Chris - there is a subtlety here.

    The Nyquist limit details how many samples one must take to accurately reproduce a signal of a given frequency without aliasing.

    If a sound was generated by a single staionary point source in an infinitely absorbing room (ie. no echo), then Nyquist will tell you everything you need to know to reproduce that sound.

    However, when you start positioning sounds in space, higher frequencies become important. While the human ear can only hear frequencies up to about 22.5kHz (on average - some people can hear more, some less), it can discriminate between the arrival times of sounds at much higher resolution - on the order of what would be a frequency of 100,000Hz. That is, if the same sound wave arrives 10 microseconds apart, at one ear first and then the other, the listener can tell the difference, and interprets this as spatial separation of the sounds.

    A lot of positioning information is encoded in the higher frequency domain. So while Nyquist is strictly correct for a given signal, it's a very much idealised form when you're dealing with stereo positional audio.
  • Wow.

    Thanks for that.
  • 10/26/2004 2:31 PM Eric Lippert

    > That's why when you listen to overly
    > compressed audio, things like applause
    > and symbol crashes sound awful.

    Ann if ewe overtly comprise dictionaries four spilling chequers, sings like cymbal clashes look awe full? @ leased they sound like symbol crashes.
  • The other factor is the dynamic range (loudness) of human hearing. We can hear over a range of 120dB although sustained levels of over 90dB can damage our hearing. CD audio can achieve about 96dB. AC-3 and DVD-Audio formats achieve more than this.
  • Simon: "about 22.5kHz (on average - some people can hear more, some less)"

    Are you sure about that? I've always thought that it is significantly lower (somewhere around 16 kHz). In fact, if this would be true, it would mean that some people would be able to actually hear that there is missing some high frequency signal on a CD audio.
  • >>"there's a high pass filter that smooths the output of the signal."

    Just one comment: It's not a high pass, it is a *low* pass filter that is used to suppress the frequency portions that are "mirrored" into the signal from the next band (overlaid by a sinx/x curve), assuming it is not an ideally bandlimited signal that you are sampling. The low pass should let pass all frequency portions from DC to the highest frequency (i.e. 20kHz) and should then have a very steep curve to suppress everything at or above half the sample frequency (i.e. 22.1kHz). This is also where the quality of the analog circuitry comes into play: You can create very steeply curved low pass filters with only few RC elements (e.g. Chebychev filters) but those have a non constant group delay (some people can actually hear this), which means that e.g. the lower frequencies arrive later at the listener's ear than the higher frequencies or the other way round. If you want steep filters but constant group delay you need more RC elements in the analog filter and thus more complex and expensive analog filters (e.g. Bessel filters).
  • When I was taking engineering at university one of my profs mentioned that transformers hum at twice the input voltage. So transformers in North America actually hum at 120Hz, not 60Hz.

    Here's a link that explains why:

  • There is a different method of raw audio encoding than PCM which is used by Super Audio CD; a 1 bit digital stream known as DSD.

    "The DSD technology uses a sampling frequency of 2.8224 MHz, which is 64 times higher than that of CD. This enables a frequency response up to 100 kHz and a dynamic range of 120 dB across the entire audible range."

  • Petr - sorry, I meant 22.05kHz :) I missed a decimal.

    Nyquist states that you sample at twice the frequency of the signal you want to reproduce (with the caveats I previously mentioned re: spatial positioning).

    CD audio was chosen to record at 44.1kHz because most people top-out their hearing at the high end at 22.05kHz (which, by Nyquist, is sampled at 44.1kHz).

    DAT tapes record at 48kHz because they didn't want them to be compatible with CD audio on a direct binary level - the idea being to reduce piracy. In actuality, all that really happened because of it was tape went the way of the dodo.
  • DAT is still used in the music industry though.
  • Times like this make me wish I had your brain. I don't really want to go work for the Windows Audio team, but it's almost as if I don't really have to if I could just get exclusive access to just SOME of your brain. If there was just some way of harnessing it safely and cheaply.

    I guess I'll just have to settle for more posts like this. I used to think some of this stuff was over my head, but it really isn't if it's given in such a clear explaination. One can get lost in the techno-babble of the audio world quite easily, but somehow I understood everything that was said.

    Thanks again.
  • I remember being told that one of the other problems with the sampling rates is that before encoding the source signel must be low pass filtered to prevent frequencies above the nyquist limit sneaking through and causing aliasing.

    Because this filtering cannot be ideal (because of the group delays mentioned above) you lose some of the higher frequencies below the nyquist limit. If you expand the sample rate, you can make the filter higher and you save more of the perceptible audio
  • Interesting article and comments.

    48KHz was already established as a standard for digital audio when CD Audio was being defined. Steward seems to state the motive for 48KHz. The 44.1KHz standard supposedly comes from making the physical CD fit in a Japanese-size car stereo slot and still hold a certain amount of music.

    I remember in the late 80's PC games were attempting to use the original PC speaker to output digitized sound. It was worse than phone audio, and compute intensive, and 24-bit audio cards in a retail box cost $30 so it's not much use.

  • 10/27/2004 5:25 AM Petr Kadlec

    >> Simon: "about 22.5kHz (on average - some
    >> people can hear more, some less)"

    [or 22.05 kHz, it doesn't matter much to this]

    > I've always thought that it is significantly
    > lower (somewhere around 16 kHz).

    The average varies with age and gender. For adult men it might wee be 16 kHz. But the overall average is still the overall average.

    > In fact, if this would be true, it would
    > mean that some people would be able to
    > actually hear that there is missing some
    > high frequency signal on a CD audio.

    That's exactly true. No matter what the average is, that means some people are below it and some people are above it. There always have been people who are put off by the missing high frequencies on CD audio -- and by inconsistent phase shifts, and on white noise or pink noise added by ordinary electronic amplifiers.

    10/28/2004 10:59 AM ATZ Man

    > The 44.1KHz standard supposedly comes from
    > making the physical CD fit in a Japanese-
    > size car stereo slot and still hold a
    > certain amount of music.

    Huh? Japanese car stereo slots are the same size as import car stereo slots. Of course there aren't a lot of imports, and most of those are from Germany not from the NAFTA zone, but they all take the same kinds of stereos.

    There are other kinds of optical disks such as MDs, but those didn't exist when CDs were first developed, and they sure didn't figure in setting frequencies. I've read that MDs use a lossy compression algorithm. They were popular for a while because of their portability (e.g. can be worn while riding a train) but that market has moved to flash memory.
Page 2 of 3 (39 items) 123