How (and Why) Are MP3 Files So Damn Small?

June 12th, 2014 by Justin Colletti

A technician works on one of Google's enormous data centers.

A technician works on one of Google’s gargantuan data centers.

The MP3 format was first released 21 years ago this Summer. It goes almost without saying that it, and the other compressed audio formats that followed, changed the music world forever.

Today, there are tens of millions of Americans who grew up with a music marketplace almost entirely different from the one that exists now. And there are tens of millions more who simply can’t remember a time before exhaustive libraries of music were everywhere, ready to roll at the click of a mouse.

Although data compressed audio has been around long enough to vote, drive, serve in the armed forces and buy a bottle of bourbon, there are few audio engineers, nevermind casual listeners, who can quite describe how these lossy formats work.

So many among us also remain puzzled as to why they’ve stayed so small in size in a world where bandwidths and processor speeds continue to become exponentially faster and more powerful. But there are answers, and they are not beyond understanding.

Just How Much Smaller Are They?

In a streaming world, it’s useful to talk about audio files in terms of their “bit rate.” This is a term we can use to describe how many bits of data a file spits out every second.

It’s pretty easy to calculate the bit rate of standard full-resolution “CD quality sound.” Just take the number of samples the file uses per second (44,100 of them) and multiply that by the number of bits used by each sample (16, in this case).

sponsored

Although it is definitely possible for trained listeners to hear the difference at some resolutions, studies show that most musicians and many audio nerds have trouble telling even today’s relatively low-res 128kbps MP3 codecs apart from standard resolution formats. Not too bad for a file 1/11th the size. (You can test yourself at http://mp3ornot.com. Although some of us can get it right 10 times out of 10, even we have to admit it’s a pretty subtle difference.)

When we increase the bit rate to 256kbps, most trained listeners can’t tell the mp3 apart from the source in a blind test. And at 320kbps? There is currently no evidence of even trained listeners telling an MP3 apart from any higher resolution file in a properly controlled listening test. (That includes super-duper-high-resolution files like 24/192 WAV. Not bad for a file 4x – 5x smaller than a standard resolution!)

Even at old and outmoded resolutions like 128 kbps, today’s MP3s are arguably higher in fidelity than AM/FM radio, cassette, vinyl, and essentially any other historical audio format. At higher bit rates, it’s no contest at all. We are now beyond the days of “convenience vs quality” when it comes to consumer audio formats.

How Do They Get So Small?

A technician walks the halls of a Google data center.

The process of compressing audio files is interesting in itself.

We can’t reasonably make an audio file smaller by just throwing data away will-nilly. If we were to try and just halve the size of a track that way — by say, throwing away 8 bits of information — almost anyone could hear the difference.

If we did this, we wouldn’t just lose half of our dynamic range — We’d end up cutting it exponentially, from 65,536 possible values way down to 256. This is something you’d almost definitely hear by the way of a dramatically increased noise floor. All that sacrifice, and we’re still not coming anywhere near the data savings of even the largest MP3s.

The key to effective lossy compression lies in something called “perceptual coding.” This is basically a fancy way of saying “exploiting all the ways in which our brains don’t work quite right.”

Much like a film camera works by exploiting a quirk of our perception, recording just 24 snapshots per second to create a fluid image and throwing the rest away, an MP3 works because our ear and brain are simply incapable of processing all the acoustic information around us. When we remove this information, we do not miss it, because it is information that we are not equipped to process by nature.

Just like with film and video frame rates, there are two questions to ask: What’s the minimum we can get away with and still make people happy, and what’s the maximum resolution that will confer some kind of advantage?

Below a certain point, we are not likely to create a satisfying aesthetic experience. Above a certain point, the human mind and body can simply not process any additional information, making any addition of resolution an exercise in senseless self-indulgence and misleading marketeering.

So, in order to design effective audio codec, engineers and programmers have to know what we can and can’t hear, and under what circumstances. The three most important quirks of our hearing in this context are Temporal Masking, Simultaneous Masking, and the Absolute Hearing Threshold.

To understand simultaneous masking, imagine standing next to a roaring jackhammer and dropping a pin to the ground. Do you think you’d be able to hear the pin drop? If you answered “no”, then congratulations: You are not lying, and/or not stupid.

Temporal masking is similar: If you set a firecracker off next to ear (please do not do this) and then dropped a pin immediately after, do you think you’d hear that pin? If you dropped it quickly enough, the answer would once again be a clear and certain “no”.

In both cases, it’s not that the pin doesn’t make a sound. It does, and with equipment sensitive enough, we could measure that sound.It’s just that our systems of perception are incapable of hearing it — much in the same way we are incapable of seeing 24 frames of film per second as a series of still images. With all confidence, we safely throw away the sound of that pin dropping without affecting anyone’s listening experience in the slightest.

One of the cool things about the masking effect is that since it is dependent on frequency, by testing enough people, we can discover what combinations of loudness and frequency reliably trigger it. As we figure this out, we can develop better and better audio encoders, surgically removing only those sounds that we know will not be missed. And this is exactly what engineers and programmers have been doing for decades.

To offer this kind of surgical precision, lossy codecs split the audio stream into frames, and then split those frames into hundreds of frequency bands, each of which are fed into FFT (“Fast Fourier Transform“) analyzers. Then, an algorithm measures the interplay between bands, and removes the data used to represent the parts of the sound wave that human beings just can’t hear.

Why Have They Stayed So Small?

Estimates are that server farms like this one, now use roughly 2% of the electricity in U.S.

In recent years, we’ve moved from telephone modems offering 28kbps and 56kbps, to cable modems capable of up to 30Mbps, to fiber lines that can provide speeds in excess of 500Mbps. This represents an increase of almost 18,000 times in maximum bandwidth available to consumers.

For the time being, the speed and power of computer technology may continue to grow at an exponential rate. But as it grows, so does the power consumption, the size of internet data centers, and the amount of users, traffic and storage on the web.

This helps explain why MP3s themselves have barely increased in size, moving from 96kbps and 128kbps a decade ago to just 256kbps and 320kbps today, while general bandwidth has increased by factors in the tens of thousands. Although it’s plausible that we could increase the size of streaming media files, it is not necessarily practical.

Whatever level the rest of the technology reaches, switching from bit rates around 256kbps to 320kbps to 1,441kbps or more for uncompressed audio would mean increasing the size of already enormous server farms by a factor of as much as 4.5 or 5.5 times.

Is it worth it to go through 5.5 times as many hard drives, 5.5 times as much electricity, 5.5 times as much real estate, 5.5 times as much bandwidth, 5.5 times as much battery power, all for a difference it seems no one can hear? (Let’s not even get started on 24/192 files which would require at least 29 times the resources for no appreciable difference.)

If you can find someone who wants to foot the bill, by all means. Until then, it’s unlikely the standard for streaming audio files on the web will get very much larger anytime soon.

Justin Colletti is an audio engineer, educator and journalist.

Please note: When you buy products through links on this page, we may earn an affiliate commission.

How (and Why) Are MP3 Files So Damn Small?

2 Comments on How (and Why) Are MP3 Files So Damn Small?

Joel Douek

Related posts:

2 Comments on How (and Why) Are MP3 Files So Damn Small?

Joel Douek