There is a cool service called Shazam, which take a short sample of music, and identifies the song. There are couple ways to use it, but one of the more convenient is to install their free app onto an iPhone. Just hit the “tag now” button, hold the phone’s mic up to a speaker, and it will usually identify the song and provide artist information, as well as a link to purchase the album.
What is so remarkable about the service, is that it works on very obscure songs and will do so even with extraneous background noise. I’ve gotten it to work sitting down in a crowded coffee shop and pizzeria.
So I was curious how it worked, and luckily there is a paper written by one of the developers explaining just that. Of course they leave out some of the details, but the basic idea is exactly what you would expect: it relies on fingerprinting music based on the spectrogram.
Here are the basic steps:
1. Before hand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10 second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.
Here’s how the fingerprinting works:
You can think of any piece of music as a time-frequency graph called a spectrogram.. On one axis is time, on another is frequency, and on the 3rd is intensity. Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise. Here’s one example of how a song might look:

Spectrogram of a song sample with peak intensities marked in red. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 1A,B.
The Shazam algorithm fingerprints a song by generating this 3d graph, and identifying frequencies of “peak intensity.” For each of these peak points it keeps track of the frequency and the amount of time from the beginning of the track. Based on the paper’s examples, I’m guessing they find about 3 of these points per second. [Update: A commenter below notes that in his own implementation he needed more like 30 points/sec.] So an example of a fingerprint for a 10 seconds sample might be:
| Frequency in Hz | Time in seconds |
| 823.44 | 1.054 |
| 1892.31 | 1.321 |
| 712.84 | 1.703 |
| . . . | . . . |
| 819.71 | 9.943 |
Shazam builds their fingerprint catalog out as a hash table, where the key is the frequency. When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs. Their hash table might look like the following:
| Frequency in Hz | Time in seconds, song information |
| 823.43 | 53.352, “Song A” by Artist 1 |
| 823.44 | 34.678, “Song B” by Artist 2 |
| 823.45 | 108.65, “Song C’ by Artist 3 |
| . . . | . . . |
| 1892.31 | 34.945, “Song B” by Artist 2 |
[Some extra detail: They do not just mark a single point in the spectrogram, rather they mark a pair of points: the "peak intensity" plus a second "anchor point". So their key is not just a single frequency, it is a hash of the frequencies of both points. This leads to less hash collisions which in turn speeds up catalog searching by several orders of magnitude by allowing them to take greater advantage of the table's constant (O(1)) look-up time. There's many interesting things to say about hashing, but I'm not going to go into them here, so just read around the links in this paragraph if you're interested.]
If a specific song is hit multiple times (based on examples in the paper I think it needs about 1 frequency hit per second), it then checks to see if these frequencies correspond in time. They actually have a clever way of doing this They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample. If there is a temporal relation between the sets of points, then the points will align along a diaganol! They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.

Top graph: Songs and sample have many frequency matches, but they do not align in time, so there is no match. Bottom Graph: frequency matches occur at the same time, so the song and sample are a match. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 2B.
Tags: Avery Wang, Computer Science, data structures, hash, iPhone, search, shazam, spectrogram
February 20, 2009 at 5:23 pm |
Very very well explained !
Do you actually know any code (audacity may be ?) to obtain a time-frequency graph from a common audio file (wav / mp3 … any other) ??
February 20, 2009 at 6:03 pm |
Thanks for the kind words!
Yes, Audacity does provide spectrograms. From the audacity 1.2 reference: “You can view any audio track as a Spectrogram instead of a Waveform by selecting one of the Spectral views from the Track Pop-Down Menu.”
Now finding the code to do this within their source might be tricky, so I would recommend another project. The key to this code is in the Fourier Transform, which is usually done by the FFT algorithm ( http://en.wikipedia.org/wiki/Fast_Fourier_transform ).
Whenever I’m looking for source code for something like this I usually try http://www.sourceforge.net. Just try a search for FFT and you should find some good stuff.
February 24, 2009 at 3:02 am |
Thanks for the nice description – sure makes the paper more readable…
Do you happen know how the “peak intensity” points are found?
I understand they have a limit on the number of constellation points based on some requirements.
However, I didn’t understand how to select the “right” candidates from all the local maximum points in the spectrogram (how do I choose a constellation point? how do I choose an anchor point?)
February 24, 2009 at 6:13 pm |
Ralphy, they do not go into detail in what exactly it means to be a peak point. but they give some clues. Here’s the relevant paragraph from section 2.1 of the paper:
“The peaks are chosen using a criterion to ensure that the density of chosen local peaks is within certain desired bounds so that the time-frequency strip for the audio file has reasonably uniform coverage. The peaks in each time-frequency locality are also chosen according amplitude, with the justification that the highest amplitude peaks are most likely to survive the distortions listed above.”
So I’m guessing they might do something such as grabbing a small area of the spectrogram and picking the highest 1 or 2 amplitude points, then continuing to the next section. This heuristic would meet their requirement of a uniform distribution while finding local amplitude maximas.
As far as the anchor points go, I got the impression that it really didn’t matter, as long as it would be a unique and reproducible pairing of points. I believe it was more for hashing performance benefits than for anything else.
February 25, 2009 at 12:41 pm |
Actually that idea won’t work because there is no way way to correlate sections on the time dimension for different samples. Any other ideas?
September 27, 2009 at 10:49 pm |
How do the frequency, and peak points match at all, when the frequencies vary so greatly based on ambience, and sound quality? The low frequencies must definitely be lost.
October 20, 2009 at 11:11 am |
[...] OK, but how does Shazam make these fingerprints? As Avery Wang, Shazam’s chief scientist and one of its co-founders, explained to Scientific American in 2003, the company’s approach was long considered computationally impractical—there was thought to be too much information in a song to compile a simple signature. But as he wrestled with the problem, Wang had a brilliant idea: What if he ignored nearly everything in a song and focused instead on just a few relatively “intense” moments? Thus Shazam creates a spectrogram for each song in its database—a graph that plots three dimensions of music: frequency vs. amplitude vs. time. The algorithm then picks out just those points that represent the peaks of the graph—notes that contain “higher energy content” than all the other notes around it, as Wang explained in an academic paper he published to describe how Shazam works (PDF). In practice, this seems to work out to about three data points per second per song. [...]
October 27, 2009 at 7:27 am |
For those interested in more details, I made my own implementation based on the paper, using the Matlab signal processing environment:
http://labrosa.ee.columbia.edu/~dpwe/resources/matlab/fingerprint/
If you can read Matlab source, you can see how it works. I played around with it a fair bit to try to get noise robustness similar to the Shazam app. I ended up needing a lot more than 3 landmarks/sec – more like 30.
October 27, 2009 at 11:32 am |
This is very cool, I’ll have to check out your code. The 3 points/second number was extrapolated from the graphs of the examples in their paper, but it might not be accurate for real-world use cases.
October 29, 2009 at 8:01 am |
[...] now I know, thanks to an article in Slate that linked to a explanation of the original paper describing the method. The key is in compressing the database of 2M tunes [...]
October 30, 2009 at 11:03 am |
Too good article…very well explained. Thanks for sharing the information !!
October 31, 2009 at 3:51 am |
It’s pretty easy to get a spectrogram of a wav file using Python using matplotlib:
http://www.yhvh.co.uk/blog/spectrogram-python
November 4, 2009 at 11:37 pm |
[...] http://laplacian.wordpress.com/2009/01/10/how-shazam-works/ [...]
November 11, 2009 at 8:24 pm |
Playing with Shazam fingerprints…
[...] A few weeks ago, I took some time to read the paper and implement [...]…
November 11, 2009 at 8:42 pm |
[...] http://aubio.org/news/20091111-2339_shazam [...]
November 20, 2009 at 5:23 pm |
[...] http://laplacian.wordpress.com/2009/01/10/how-shazam-works/ [...]
November 23, 2009 at 4:11 pm |
This is so much easier to understand than Wang’s paper :)
Thanks.
I’m still amazed by what Shazam is able to do.
The question that’s still lurking in my head is – how does the program know how far into a song it is? I’d think that people usually start Shazaming a song after it has started playing (oftentimes, a few minutes into the song). How can the program match all the peak intensities so fast? And they must need a huge server/powerful computers to process all these inquiries going in eh?
November 23, 2009 at 4:54 pm |
The program knows how far it is into the song because the time for each point is stored in the hash table. It looks up that time in O(1) which is why it is so fast. It does not do a linear search which you seem to be implying which is O(n) and much slower. Definitely read more about hashtables and you will understand why they are so powerful. Yes, they probably have large servers to handle the bandwidth and potentially tens of thousands of concurrent matching requests. The good thing is that the matching requests are read-only, so that makes the problem much easier.
November 24, 2009 at 8:15 am
Thanks Bryan :)
one more question – how do they deal with the “legal” issues? Did they actually buy all the songs stored in their massive database? (just out of curiosity)