How Shazam Works

January 10, 2009

There is a cool service called Shazam, which take a short sample of music, and identifies the song.  There are couple ways to use it, but one of the more convenient is to install their free app onto an iPhone.  Just hit the “tag now” button, hold the phone’s mic up to a speaker, and it will usually identify the song and provide artist information, as well as a link to purchase the album.

What is so remarkable about the service, is that it works on very obscure songs and will do so even with extraneous background noise.  I’ve gotten it to work sitting down in a crowded coffee shop and pizzeria.

So I was curious how it worked, and luckily there is a paper written by one of the developers explaining just that.  Of course they leave out some of the details, but the basic idea is exactly what you would expect:  it relies on fingerprinting music based on the spectrogram.

Here are the basic steps:

1. Beforehand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10 second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.

Here’s how the fingerprinting works:

You can think of any piece of music as a time-frequency graph called a spectrogram.  On one axis is time, on another is frequency, and on the 3rd is intensity.  Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise.  Here’s one example of how a song might look:


Spectrogram of a song sample with peak intensities marked in red. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 1A,B.

The Shazam algorithm fingerprints a song by generating this 3d graph, and identifying frequencies of “peak intensity.”  For each of these peak points it keeps track of the frequency and the amount of time from the beginning of the track.  Based on the paper’s examples, I’m guessing they find about 3 of these points per second. [Update: A commenter below notes that in his own implementation he needed more like 30 points/sec.]  So an example of a fingerprint  for a 10 seconds sample might be:

Frequency in Hz Time in seconds
823.44 1.054
1892.31 1.321
712.84 1.703
. . . . . .
819.71 9.943

Shazam builds their fingerprint catalog out as a hash table, where the key is the frequency.  When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs.  Their hash table might look like the following:

Frequency in Hz Time in seconds, song information
823.43 53.352, “Song A” by Artist 1
823.44 34.678, “Song B” by Artist 2
823.45 108.65, “Song C’ by Artist 3
. . . . . .
1892.31 34.945, “Song B” by Artist 2

[Some extra detail: They do not just mark a single point in the spectrogram, rather they mark a pair of points: the “peak intensity” plus a second “anchor point”.  So their key is not just a single frequency, it is a hash of the frequencies of both points.  This leads to less hash collisions which in turn speeds up catalog searching by several orders of magnitude by allowing them to take greater advantage of the table’s constant (O(1)) look-up time.  There’s many interesting things to say about hashing, but I’m not going to go into them here, so just read around the links in this paragraph if you’re interested.]


Top graph: Songs and sample have many frequency matches, but they do not align in time, so there is no match. Bottom Graph: frequency matches occur at the same time, so the song and sample are a match. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 2B.

If a specific song is hit multiple times (based on examples in the paper I think it needs about 1 frequency hit per second), it then checks to see if these frequencies correspond in time.  They actually have a clever way of doing this  They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample.  If there is a temporal relation between the sets of points, then the points will align along a diagonal.  They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.

Add to DeliciousAdd to DiggAdd to FaceBookAdd to Google BookmarkAdd to MySpaceAdd to NewsvineAdd to RedditAdd to StumbleUponAdd to TechnoratiAdd to Twitter

Z2K9 and E2K9

January 4, 2009

This New Year’s we witnessed 2 catastrophic tech failures:  Zune and Evite.  On New Year’s Eve, Zune 30s all over the world froze.  Microsoft said it was due to “‘a bug in the internal clock driver,” and that affected Zunes would work the following day.  For evite, the site went down on one of the busiest party nights of the year around 6pm EST and did not come up until after the ball dropped.  I, along with possibly millions of others, could not access NYE party information which is only stored on the evite site.  This is particularly bad as evite has more competition than ever, and people are just looking for reasons to switch to something else more web 2.0-ish.

So, some industrious poster tracked down the Zune clock source code.  Here is the bad code, which is called whenever the Zune needs the current date:

year = ORIGINYEAR; /* = 1980 */

while (days > 365)
    if (IsLeapYear(year))
        if (days > 366)
            days -= 366;
            year += 1;
        days -= 365;
        year += 1;

This method take an argument of “days”, which is the number of days since 1/1/1980.  If days=366 and isLeapYear() is true, then this loop does nothing and will loop forever.  These conditions always occur on the last day of a leap year (e.g 12/31/2008) !  The bottom-line is that when you write a loop, you need to be certain it will terminate.  Having a convoluted assortment of if-statements it not a good way to accomplish this.

So we know the Zune crash was caused by a date bug, but the evite error has not been explained.  I donned my investigative reporting hat, and sent an email to evite support asking what went wrong on NYE.  They replied with an email which clearly was anot written by a human, which lead me to believe that perhaps the outage was caused by a take-over of robots.  Anyway, I did not give up that easily and replied again this time with a concise: “What caused the outage?”  I haven’t heard back yet, but I’ll keep you posted!

Universal Instruments

December 21, 2008

I’ve written before about how great Netflix streaming movies are.  The library isn’t spectacular; however, if you’re into documentaries, you can instantly watch a huge selection.  I love docs, and over the past couple weeks I’ve seen a few specifically about artists: Rivers and Tides, Scratch, B-Boy Planet, and Touch the Sound.

Scratch is about turntabalism, a musical form where the instruments are record players.  While these turntables produce highly unique sounds, they also are capable of mimicking any other instrument depending on the given input (i.e. record).  This will instantly remind any computer scientist of universal Turing machines, which are capable of mimicking any other Turing machine.  (Turing Machines, first defined by Alan Turing, are the basis of all computation including human intelligence (more about this in future posts)).

A musical instrument which can mimic any other might be called a “universal instrument.”  It seems there are 2 categories of universal machines: audio players which can play an arbitrary input (e.g. record players, cd players, mp3 players), and instruments which can change their tones in real-time (e.g. tone generators, synthesizers, computers).  Any device which can output arbitrary sine waves in unison can re-create any other instrument due to Fourier decomposition.

Some might debate whether a cd player is really an instrument, but it becomes clear when you realize a musician can manipulate the output through even rudimentary controls (e.g. seeking, volume).   A more realistic debate is whether such a player is really a universal instrument, if all of the information is only in the input string.  Interestingly, this parallels a similar debate about a disputed univeral Turing machine! Stephen Wolfram offered $25,000 for a proof that this machine was in fact universal.  It only has 2 states and 3 symbols and is the simplest possible universal machine, although some have debated whether the universality is only a result of a specific input, which actually encodes the universality, and not the machine itself.

Group Theory

December 14, 2008

I already wrote about Set theory and I eventually want to write about symmetry, so this seems like the perfect opportunity to discuss Group theory.  Let’s start where we left off with Set theory.  We had a very abstract concept of a Set: a collection of “things”.  Then we added some basic structure to that collection (the operation of set concatenation), and before we knew it we were counting and adding.

Well, to formalize it a bit, when you take a set and add a binary operation, and an identity element, then what you get is called a monoid.  Monoids are extremely abstract yet very powerful.  (In a way they even form the underpinning of all of computer science, where the set is a collection of functions, and the product is simply chaining these functions together, called function composition, and the identity is simply a “noop“.)  I’ll talk more about this in a future post because it is so cool how these ideas tie together.

A Group is just like a Monoid, except every element in the set must also contain its inverse.  When an element operates on its inverse, it results in the identity!  Some basic examples, are the integers\mathbb{Z}, where every element also has its addtive inverse, called its “negative”.  Another example are the rationals, \mathbb{Q}, where every element has its multiplicative inverse, called its “reciprocal”.  In the former example, the operation is addition and addtive identity is 0, while in the latter the product is multiplication and the multiplicative identity is 1.  Both of these groups also commute (i.e. the order of operation does not matter), and we call these groups “abelian“.  They are named after the mathematician Niels Henrik Abel, but strangely the word is not capitalized.

These groups may seem pretty abstract at first, but here are a few more tangible examples.  First, is that to physicists, a set with elements and their inverses will immediately remind one of a set of particles and their antiparticles.  But we can actually get much more tangible than that if we pull the same trick we pulled above with our monoid:  instead of thinking about objects, get Categorical, and think about processes.  It’s easy to imagine a set of transformations and their inverses: raise and lower, shrink and grow, cut and paste.  All of these examples have the identity of  “do nothing”.

Groups are strongly related to symmetry because an object which is equivalent to the original after transformations is at the heart of symmetry!  In future posts we will talk about the groups of rotations and reflections to explore these concepts.

Biodynamic Marketing

November 30, 2008

Marketing is not only a good way to sell a product, but, it is also a remarkable way to make the product  “better” than other products which are actually equivalent.  If you read, Malcolm Gladwell’s book, Blink: The Power of Thinking Without Thinking, you might remember the research of Louis Cheskin.  He was one of the first researchers to apply scientific principles to marketing.  In several studies, by modifying only the packaging, the consumers would say one brand tasted better than another which was actually identically except for the packaging!

I live in the San Francisco Bay Area, and there is nothing quite like the wine market in this area.  Literally thousands of different wines, which probably only experts could discern between, are fighting for consumers dollars.  One variety of wine which is doing quite well in the Bay Area marketplace, is labeled as “Biodynamic”.  I’ve had biodyanic wine in the past, I just assumed it meant “organic” and/or grown properly with the local ecology (whatever that means).  I was never sure what it meant really, except it was obvious the name was marketing gold.  I didn’t think much about it though, until I read a remarkable article in a local free paper, SF Weekly, about the phenomenom.

It turns out, Biodyanic wine is based on the occult, astrology, and bullshit (yes, both meanings).  There are ritualistic preparations involving burying animal parts stuffed with various substances where they can ferment for a season.  All of this is done in “correspondence” with the stars and planets.

For example preparation 502:

Yarrow flowers are buried sheathed in a stag’s bladder. This is hung in the summer sun, buried over winter, then dug up the following spring. The bladder’s contents are removed and inserted in the compost (the used bladder is discarded).




This leads to an interesting marketing challenge: focus on the term “biodynamic” and other hot marketing buzzwords such as “organic”, “green”, and “eco-friendly”, but hide the underlying occult rituals.  We must wait and see how the marketing proceeds, but I know one thing: I will stay away from these wacky wines!

Our Ribosome

November 23, 2008

When you think about nano-machines, which perform a complex mechanical operations at the molecular level, you might think of science-fiction and the promise of nanotechnology.  Well, one of the most incredible nanomachines is in our cells: the Ribosome!  The magnificent contraption created after hundreds of millions of years of evolution reads an mRNA strand as input, and outputs a protein made of amino acids.  (Yes, we read about these chilralitic wonders in a previous post.)

The ribosome is found in all living cells, both Eukaryotes (protozoa, algae, humans, etc . . .) and Prokaryotes (bacteria and archaea).  It is composed of 2 subunits, one large and one small.  For Prokaryotes the small subunit is labeled 30S, and the large subunit 50S.  For Eukaryotes, they are 40S and 60S respctively.  (The S is for Svedberg units, a measure of sedimentation.)  The are a composed of a complicated intertwining of proteins and rRNA and look like this:

the small subunit on the left contains an RNA molecule (cyan) and 20 proteins (dark blue); the large subunit on the right contains two RNA molecules (grey and slate) and more than 30 proteins (magenta). The image also shows a transfer RNA (orange) bound to the active site of the ribosome. (Harry Noller, UCSC)

The small subunit on the left contains an RNA molecule (cyan) and 20 proteins (dark blue); the large subunit on the right contains two RNA molecules (grey and slate) and more than 30 proteins (magenta). The image also shows a transfer RNA (orange) bound to the active site of the ribosome. (Harry Noller, UCSC)

Here’s how the process works in Prokaryotes:

  1. After mRNA is transctibed from DNA, it reaches (I’m not sure how), the small Ribosome 30S subunit.  Attached to 30S are 2 initiation factors, IF1 and IF3, which keep 30S seperated from the larger 50S.  the mRNA contains a special start codon (usually AUG), which marks the beginning of the mRNA sequence to read for protien synthesis.  A codon is a set of 3 genetic bases, which corresponds to a specific amino acid, but more about that later.
  2. IF2 binds to the intitiator tRNA which holds the start anticodon, and IF2 brings this initiator tRNA to the ribosome’s P-site.  If the initiator codon is AUG, the anticodon on the initiator tRNA will be UAC.  Those sequences will bind together.
  3. Once IF2 engages, it deposits the initialtor tRNA at the P-site, then all IFs disengage.  This allows 50S to attach with 30S, surround the P-site.
  4. The tRNA which contains the anticodon for the next codon triplet in the mRNA sequence is guided into the A-Site inside the ribosome.  Each tRNA contains an anticodon on one end and an amino acid on the other end.  It is these amino acids which will be bound through peptide bond formation to form our protein!
  5. After the bond is formed, the tRNA in the P-site is ejected out of the ribosome, and the tRNA in the A-Site is ratcheted into its place.  Then the cycle repeats until the special stop codon on the mRNA is reached.
  6. When the stop codon is reached, the subunits disengage, and IF1 and 2 re-engage 30S, starting the process all over gain.

There is a large amount of research into modeling this incredible process.  One example is this remarkable video produced by the Weizmann and Max-Planck Institutes:

UPDATE:  There is some debate as to the realism of this video.  Unfortunately I have not found much information on it, which leads me to believe that this video is only remotely representative of actual processes.   It is still quite instructive, so please just take it with a grain of salt.

P/E 10

November 18, 2008

Here is a logarithmic chart of the S&P 500 since 1950:


It appears to consistently grow at an exponential rate.  Remember this is a logarithmic graph, so exponential growth is represented as a straight line.  This is growth of about 6% per year over 68 years.  If you invested $1000 in 1950, that would be worth about $50,000 today.   If you kept on adding $1000 per year, you would have nearly $1 million. That’s simply the magic of compounding interest.

Complex dynamic systems, like the stock market, can often be described by fractals.  One attribute of fractals is that they are self-similar, that is if you zoom in on one part of a fractal, it often appears similar to the whole image.  The same phenomenon happens with coastlines and plants for example, but more about that in other post.

Anyway, when a non-trivial period of time does not follow this basic growth rate, people begin to ask why.  Take, for example, this chart of the S&P 500 of the past 12 months:


Some features of this chart appear to resemble the 68 year chart above, except for one glaring exception:  it appears upside down!  A long investor sees this chart and they cry themselves to sleep :0 .  In this chart the growth rate is a dismal -40%.

When this happens, investors look for people who predicted such a huge divergence from the trend.  One predictor, getting such attention right now, is the famous economist Robert Shiller.  In 2000, he wrote the book Irrational Exuberance, where he explained why he believed the stock market was overvalued.  One of his main statistical weapon of choice, is called the P/E 10 (or Shiller P/E).  It is a chart of price-to-earnings ratios, but earnings are averaged over the previous 10 years instead of the usual 1 year.  Here is a chart from James Hamilton’s economics blog:

On average, when the market has a P/E of greater than 20, it means generally that prices will fall and/or earnings will rise.  Conversely, when the P/E 10 is less than 10, it means generally that prices will rise and/or earnings will fall.  The historical P/E 10 average is 16.3 (the red line in the chart), while the current P/E 10 is 14.  This measure demonstrates that the market is reasonably valued.

Hindsight is 20/20, so is this chart actually predictive?  Well, we’ll have to wait and see, but the idea behind it is there are times when the market is over-valued and times when it is under-valued.  This seems logical, but according to the Efficient-Market Hypothesis, this will never happen because the market defines the value.  Shiller is an economic behaviorist, and believes the market is not efficient but rather based largely on psychological factors which are not entirely logical.  To me, both positions seem to make sense, and I intuitively feel (I’m sure naively) that there is common ground with these two perspectives.  (My basic idea has to do with the relativity of value by individual investors and specifically their time-frame in which they must sell the equity.)

One last interesting fact about the P/E 10: it is difficult to find!  You cannot go to any financial site and pull up the P/E 10 for a given index or equity.  In fact the chart above was built from Shiller’s own data in Excel format which he thankfully distributes on his site!

UPDATE: Today, Gregory Mankiw posted this remarkable graph on his blog:


It’s hard to look at this and deny the inverse relationship between the P/E 10 and future earnings.  Currently we’re smack-in-the-middle of the x-axis, and you can see how future returns along the vertical appears random.  Just a few years ago we would have been on the right-side of the chart, leading to the conclusion that returns would likely decline.

Abstract Thinking

November 16, 2008

As a software engineer, abstracting problems is an important step in creating modular code.  Good abstract concepts are not only useful for a given problem, but they can also be applied in a variety of different situations.  It is especially exciting when 2 seemingly completely different problems can be described by the same abstraction, as this can lead to new insights and perspectives.  In fact, this paragraph itself can be abstracted by applying it to various fields besides software engineering!  I’m sure this analysis of the analysis would make Douglas Hofstadter proud ;) .

The most abstract ideas are those in mathematics.  Alone, the rules of math are not terribly useful (like a software interface without an underlying implementation), but they can be applied in extremely powerful ways to all other fields.  Let’s take the simple, yet incredible, example of numbers.

What is a number?  One explanation is to think of a practical example: pretend you have a basket of tomatoes.  You may want to count how many tomatoes you have and assign some label as the result.  These labels are numbers (or more specifically integers).  Why would you perform such a measurement?  Well in this case it seems very useful to take an inventory for storing, selling, cooking, or whatever you would like to do with your tomatoes. Now let’s say someone else gives you some more tomatoes or takes some away from you.  Numbers appear useful here too.  So we can define “operations” of addition and subtraction to accomplish this.

This concept of numbers and operations is abstract enough that you can clearly apply it to more than just baskets of tomatoes! In fact, you can apply it to “sets” composed of any “element” you can possibly imagine!  That is what a good abstraction is all about.

Let’s get formal about this in mathematical terms using Set Theory.  Given a Set S, there is a function f:S\rightarrow \mathbb{Z} which takes a set and counts how many elements are in it. This measure is called the cardinality of the set.

This function is often expressed by inserting the set in between vertical bars . Here are some examples of sets and their cardinality:

  • The empty set: |{}| = 0
  • A set of one tomato: |({tomato}| = 1
  • A set of letters: |({A, B, C, X, Y, Z}| = 6
  • A set of sets: |{{telephone}, {}, {1, 3}}| = 3

We can also talk about adding sets with another function: g:S\times S\rightarrow S.  This function takes two sets and outputs a third.  You can come up with a bunch of simple functions which do this, but one nice example is a function which concatenates the elements from both input sets and puts them into a single output set.  For example:$latex g({1, 3, 5}, {A}) = {1, 3, 5, A}.  What’s nice about this is that we can use this function to perform addition! since |{1, 3, 5}| = 3, |{A}| = 1, and |{1, 3, 5, A}| = 4.  And what is the additive identity?  It’s the empty set of course!  You can obviously cook up a bunch of other functions which relate to other mathematical operations as well.

This is all just scratching the surface, so we’ll have more cool stuff to go over in future posts.

Is Macro-Economics a Science?

November 14, 2008

A couple of posts ago I asked whether leading researchers in fields besides physics actively debate on blogs.  Well the answer is a resounding “YES,” and one example I recently came across is an active debate between world-class economists on causes of the Great Depression.  Even recent Nobel Prize winner Paul Krugman gets involved!  What I find so surprising about this is they were debating the core concepts of one of the most important events in 20th century economics based on what seemed to be philosophical grounds: whether they were “Keynesians” or “Monetarists.”  This also somewhat corresponds to the political parties of Democrats and Republicans respectively.

The basic debate is whether it is better for wealth to be controlled by public (Keynesians) or private (Monetarists) entities.  On these blogs the disagreement is whether Keynesian policies caused the depression or brought the country out of it.  The fact that there is still no consensus by experts on the causes of the Great Depression is especially troubling at the present time since we appear to be entering a serious recession.

I believe the main problem is that it is difficult to experimentally test macro-economic theories, so economists are left to follow their philosophies.  We can only hope that in the future there will be ways to actually test theories, because otherwise is it really a science?

UPDATE: In one week, Greg Mankiw defends Keynsian policies during this recession, and he takes Krugman to task for using philophy instead of a metric!  He appears to be the good exception to the rule, by not letting his polical views interfere with his analysis.

How Polling Works

November 13, 2008

Terrence Tao has a great blog entry about the mathematics of polling.  One thing he points out, which to me at first was unintuitive, is that polling accuracy is independent of the total population size given a few basic assumptions (simple random sampling , honest responses, fixed poll size, etc…)  I higly recommend reading his post, if only for his intuitive examples of why the total population size is irrelevant.

[Update: some of his examples are so good, so I will just share them here. To test the salinity of the ocean accurately how many water samples would you need to take?  Although the ocean contains ~10^{21} liters of water, only a sample of a few milliliters (10^{-3} ) is necessary to get a good estimation.   That’s -24 orders of magnitude (compared to  -5 orders of magnitude for presidential polls).   Although I would guess that the entropy (uniformity) of ocean salinity is much higher than that of political opinions.  Another example is identifying faces.  Faces have their appearance based on millions of cells, however, people can readily identify a face based on only a few bytes of data such as a black and white pixelated image.]

So what can we do with this polling magic?  FiveThirtyEight is a great website where baseball statistician Nate Silver used his techniques to predict the presidential election.   He predicted, before the polls opened, Obama would receive 349 electoral votes and he actually received 365.  The only states he missed were Indiana and Missouri — pretty impressive.

Another great site which deals with predictions is Intrade.  This is where people buy contracts on certain future events.  The price of the contract reflects how likely the market believes it to be true. (although as economist Gregory Mankiw has pointed out this might not necessarily be true, since it is also an effective hedging instrument.)   I believe Intrade also only missed predicting Indiana and Missouri before the polls opened.