Larry Osterman's WebLog

Confessions of an Old Fogey
Blog - Title

What IS audio on a PC anyway?

What IS audio on a PC anyway?

  • Comments 39

This may be well known, but maybe not (I didn’t understand it until I joined the Windows Audio team).

Just what is digital audio, anyway?  Well, at its core, all of digital audio is a “pop” sound made on the speaker.  When you get right down to it, that’s all it is.  A “sound” in digital audio is a voltage spike applied to a speaker jack, with a specific amplitude.  The amplitude determines how much the speaker diaphragm moves when the signal is received by the speaker.

That’s it, that’s all that digital audio is – it’s a “pop” noise.  The trick that makes it sound like Sondheim is that you make a LOT of pops every second – thousands and thousands of pops per second.  When you make the pops quickly enough, your ear puts the pops together to turn them into a discrete sound.  You can hear a simple example of this effect when you walk near a high voltage power transformer.  AC power in the US runs at 60 cycles per second, and as the transformer works, it emits a noise on each cycle.  The brain smears that 60 Hz sound together and turns it into the “hum” that you hear near power equipment.

Another way of thinking about this (thanks Frank) is to consider the speaker on your home stereo.  As you’re listening to music, if you pull the cover off the speaker, you can see the cone move in and out with the music.  Well, if you were to take a ruler and measure the displacement of the cone from 0, the distance that it moves from the origin is the volume of the pop.  Now start measuring really fast – thousands of times a second.  Your collected measurements make up an encoded representation of the sound you just heard.

To play back the audio, take your measurements, and move the cone the same amount, and it will reproduce the original sound.

Since a picture is worth a thousand words, Simon Cooke was gracious enough to draw the following...

Take an audio signal, say a sine wave:

Then, you sample the sine wave (in this case, 16 samples per frequency):

Each of the bars under the sine wave is the sample.  When you play back the samples, the speaker will reproduce the original sound.  One thing to keep in mind (as Simon commented) is that the output waveform doesn't look quite like the stepped function that the samples would generate.  Instead, after the Digital-to-Audio-Converter (DAC) in the sound card, there's a low pass filter that smooths the output of the signal.

When you take an analog audio signal, and encode it in this format, it’s also known as “Pulse Coded Modulation”, or “PCM”.  Ultimately, all PC audio comes out in PCM, that’s typically what’s sent to the sound card when you’re playing back audio.

When an analog signal is captured (in a recording studio, for example), the volume of the signal is sampled at some frequency (typically 44.1 kHz for CD audio).  Each of the samples is captured with a particular range of amplitudes (or quantization).  For CD audio, the quantization is 16 bits, in two samples.  Obviously, this means that each sample has one of at most 65,536 values, which is typically enough for most audio applications.  Since the CD audio is stereo, there are two 16 bit values for each sample. 

Other devices, like telephones, on the other hand, typically uses 8 bit samples, and acquires their samples at 8kHz – that’s why the sound quality on telephone communications is so poor (btw, telephones don’t actually use direct 8 bit samples, instead their data stream is compressed using a format called mu-law (or a-law in Europe), or G.711).  On the other hand, the bandwidth used by typical telephone communication is significantly lower than CD audio – CD audio’s bandwidth is 44,100*16*2=1.35Mb/second, or 176KB/second.  The bandwidth of a telephone conversation is 64Kb/second, or 8KB/second (reduced to from 3.2Kb/s to 11Kb/s with compression), an order of magnitude lower.  When you’re dealing with low bandwidth networks like the analog phone network or wireless networks, this reduction in bandwidth is critical.

It’s also possible to sample at higher frequencies and higher sample sizes.  Some common sample sizes are 20bits/sample and 24bits/sample.  I’ve also seen 96.2 kHz sample frequencies and sometimes even higher.

When you’re ripping your CDs, on the other hand, it’s pointless to rip them at anything other than 44.1 kHz, 16 bit stereo, there’s nothing you can do to improve the resolution.  There ARE other forms of audio that have a higher bit rate, for example, DVD-Audio allows samples at 44.1, 48, 88.2, 96, 176.4 or 192 kHz, and sample sizes of 16, 20, or 24 bits/sample, with up to 6 96 kHz audio channels or 2 192 kHz samples.

One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done to the data to reduce the size of the audio data.  But in most cases, when the data finally hits your sound card, it’s represented as PCM data (this isn’t always the case, for example, if you’re using the SPDIF connector on your sound card, then the data sent to the card isn’t PCM).

Edit: Corrected math slightly.

Edit: Added a couple of pictures (Thanks Simon!)

Edit3: Not high pass, low pass filter, thanks Stefan.

  • A couple of quick notes...

    I was under the impression that phones used 8khz 8 bit data, which would be closer to 7.8kb/s, not 11kb /s.

    Also, the "pops" and" brain smearing" arguments are not comet at all; there is hardware filtering after the DAC which bandwidth limits the signal and forces it to be a smooth continuous waveform. The speaker care itself also has a mass, and the inertia and momentum of the speaker itself also has a very similar effect on the signal, causing it to roll off the higher frequencies.

    So basically, no pops. Sure, you can create pop 'sounds' but only at the limiting frequency cf the output of the DAC. The sound of the transformer really is a 60 hz sound - the transformer coils resonate at the AC line frequency + distortion - no brain-driven signal integration required.

  • comet = correct. Damn TabletPC!
  • Simon, you're right, the analog filtering after the DAC smears them out, but from a conceptual standpoint, that was the easiest way of explaining the idea that these are discrete samples.

    And 60Hz is above the human threshold of hearing, which is why you can hear the transformer, I chose it because it was the most common example of a low frequency sound I could come up with.
  • Oh, and you're right, phone is 8kHz, not 11kHz. That's why I indicated that it was for devices <i> like </i> telephones.
  • Your units are wrong on your math. Your "44,1000" also has one too many 0s. The correct formula is:

    44,100 (1/s) * 16 (b) * 2 = 1411200 b/s = 1378.125 Kb/s = 1.35 Mb/s.

    That's in bits. In bytes, you get:

    (44,100 (1/s) * 16 (b) * 2) * 1/8 (B/b) = 176400 B/s = 172.27 KB/s = 0.17 MB/s

    Thus, your numbers are correct but your units (bytes vs. bits) are wrong assuming the standard notataion of 'b' = bits and 'B' = bytes. I also used 1024 B/KB (b/Kb, KB/MB, Kb/Mb) rather than 1000.
  • Strangely, given the topic, this is one case where a picture really is worth a thousand words.
  • oh, and re: the telephone stiff, I was referring to this line, Larry.

    "The bandwidth of a telephone conversation is 88kB/second, or 11kb/second, an order of magnitude lower."
  • I'd like to do a picture, but I'm not good enough in Mathematica to do justice, and I'm not going to steal someone elses work.
  • Now another question along the same lines, in like Media Player, the visualizations when listening to music, like the graphical equalizer and such. Do those pretty much tap into the same signal to draw on the screen? I would assume so, but not sure, one of those things I just sat back and enjoyed was curious but never dug into it.

    But I am assuming those pulses for the speaker are the same pulses that you see on the screen.
  • Yup, visualizations operate on the samples being sent to the sound card. Internally they're implemented as dshow filters that render their samples to the screen instead of performing some kind of transform on them.
  • Might want to clarify that you're talking about cellphones, not landline phones. The yunguns might be confused.

    You might also clarify that you can transfer PCM over S/PDIF, but that isn't the only data format available. Further, in the PCM case, the PCM data might be encoded for SPDIF on the card itself, not on the host PC. :)
  • Larry - send me an email, and let me know what you need. I should be able to knock something out pretty quickly.
  • > One thing to realize about PCM audio is that it’s extraordinarily sparse – there is a huge amount of compression that can be done

    Well, sure -- but it is important to note that in many compression schemes, it's lossy compression.

    Compression works well when there are large ranges with small variations and patterns in the data, both of which are true of SOME audio. Compression works great on "bassy" music because a 220 Hz note has a profile that looks like a sine wave with two hundred samples per wavelength.

    There are therefore two ways you could compress this. You could take advantage of the small change between any two samples and just store the differences, which take up fewer bits than the values. Or, you could take advantage of the fact that a 220 Hz sine wave is computable if you know its duration, initial value, and frequency -- just store them!

    Lossy compression schemes do exactly that -- they take the Fourier transform of the signal to determine what combination of sine waves make up a particular sample, and then just store information about those sine waves.

    The question then is "what about the high-frequency stuff?" The human threshhold for hearing is ~20KHz. Any signal sampled at 40KHz that contains loud sounds in the KHz range is going to have large variations between samples, and the details of the overall wave shape are likely to change rapidly.

    That's what makes lossy compression schemes lossy. They just throw away the high frequency information because its too hard to compress!

    That's why when you listen to overly compressed audio, things like applause and symbol crashes sound awful. Applause and symbol crashes tend to have lots of random, hard-to-compress, high amplitude, high frequency signal in them.
  • Actually, modern audio compression algorithms can do much more than simply cut off the higher frequencies -- encoders like MP3 have a psycho-acoustic model of the way we hear things to improve compression even further. For example, a strong tone will often "mask" a weak tone which is close to it in frequency, so the second one can often be thrown out as part of the lossy compression.
  • I've read that a sample rate higher than 2x the highest sampled frequency is unnecessary according to Nyquist's theorem. If this is so, and human hearing tops out at around 20KHz (or let's say 30KHz for the super humans), then sample rates higher than 40-60KHz are just taking up more space on our storage media for no real gain.

    Some say that there are still effects from those inaudible frequencies on those in the audible range. If that is so then recording microphones (which also top out at around 20KHz) would still record the effects of those frequencies. Any other improvement in perceived sound quality is attributed to the quality of the equipment in the recording and/or playback chain (or is psychological).

    Search for Dan Lavry on usenet or elsewhere for a more in depth discussion/argument of this.
Page 1 of 3 (39 items) 123