Archive for the ‘Computer Science’ Category

How IBM’s Watson computer “thinks” on Jeopardy

February 27, 2011

A computer on Jeopardy soundly defeated 2 of the best human players of all-time in a 2-game event.  In this article I will attempt to explain how this was achieved, using as my main source an article published in AI Magazine by some of the lead engineers on the project.  I’ll focus on the software algorithms which made this apparent “thinking” possible, and gloss over most of the hardware specifics as well as even the Jeopardy domain specifics.  In some ways the hardware and Jeopardy rules are arbitrary, since within 10 years we’ll all be able to run a “Watson” on our cell phones and, as has already been done, Watson can be applied successfully to other non-Jeopardy QA tasks.  At the end of this article, I’ll add my own opinion as to the significance of this achievement.

Overview

Watson is the product of IBM’s DeepQA project, whose goal is to develop a computer system which excels at the classic AI problem of question answering (QA).  As with any complicated engineering system, one of the best ways to learn how it works is with a high-level diagram:

DeepQA High-Level Architecture. click to enlarge. (Ferrucci 2010, fig. 6)

Two striking attributes of this system that are clear from the diagram, is that Watson does a 2-pass search.  First it does a broad “primary search”, and then it performs a more detailed “evidence scoring” search.  I’ll talk more about these below.  The other feature which stands out is that it has a parallel path (the gray boxes) for breaking up a single question into multiple ones, based on the question having multiple parts and various potential meanings due to ambiguities in the language. Watson will evaluate all of the resulting questions in parallel and compare the results.  Anyway, let’s go into more detail on each of the steps.

Content Acquisition

Before the first question can even be answered, Watson must attain its knowledge.  This is a non-real-time process which is done before and independently of the actual question-answering process.  This knowledge-base contains many sources of structured (e.g. a database or triplestore) and unstructured (e.g. text passages) data.

. . . the sources for Watson include a wide range of encyclopedias, dictionaries, thesauri, newswire articles, literary works,  . . . dbPedia, WordNet (Miller 1995), and the Yago ontology. (Ferrucci 2010, p. 69)

Watson will access this content during the real-time answering process within the Hypothesis Generation and Hypothesis and Evidence scoring processes which are described below.

Question Analysis

When a human is asked a question, deriving the semantic meaning of that question (what the question is actually asking), is usually the easy part, but for AI, understanding even simple questions is a formidable task.  Question analysis is where Watson derives how many parts or meanings are embedded in the question.  Later in the process, Watson will analyze the different questions separately (the gray boxes in the above diagram) and compare the results.  In this stage, Watson also determines the expected lexical answer type (LAT).

. . . a lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics.  Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors. (Ferrucci 2010, p. 70)

Hypothesis Generation

The next step of the process is to use the data from the question analysis step to generate candidate answers (hypotheses).  It does this by executing a primary search across its answer sources (described above) in a process known as information retrieval, for which Internet search engines are a common example.

A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL . . . . (Ferrucci 2010, p. 71)

These results are used to generate candidate answers, which will later be scored and compared.  These hypotheses are created in various ways depending on the type of data they came from.  For example an answer from a text passage (unstructured) might come from named entity recognition techniques, while an answer from a knowledge base (structured) data set might be the exact result of the query.

If the correct answer(s) are not generated at this stage as a candidate, the system has no hope of answering the question. This step therefore significantly favors recall over precision, with the expectation that the rest of the processing pipeline will tease out the correct answer, even if the set of candidates is quite large. (Ferrucci 2010, p.72)

Watson is configured to return about 250 candidate answers during this stage, which are then trimmed down to about 100 after soft filtering.  These filters are simple, more efficient filters than the intensive evidence scoring which follows.  A basic soft filter determines the likelihood of the candidate answer of being of the correct LAT, which was determined in the analysis stage.

Hypothesis and Evidence Scoring

In this step, Watson performs new searches which look for evidence of the generated hypotheses being the correct answer.  For example:

One particularly effective technique is passage search where the candidate answer is added as a required term to the primary search query derived from the question.  This will retrieve passages that contain the candidate answer used in the context of the original question terms. (Ferrucci 2010, p. 72)

The candidate answers and their corresponding evidence are them sent to various scorers.  Watson uses about 50 different scorers which work in parallel and then compare their results at the end.

These scorers consider things like the degree of match between a passage’s predicate-argument structure and the question, passage source reliability, geospatial location, temporal relationships, taxonomic classification, the lexical and semantic relations the candidate is known to participate in, the candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on. (Ferrucci 2010, p. 72)

Final Merging and Ranking

This final processing stage first merges answers which are deemed equivalent by using basic matching techniques and more complicated coreference resolution algorithms.  Scores from equivalent answers are then combined.  Finally, Watson uses machine learning techniques to assign a confidence level to each of the merged answers, on how likely they are to be the correct.  Depending on the confidence level, the highest ranking hypothesis is presented as the answer.

Final Thoughts

There is much debate over the significance of this achievement, even within the AI community.  I personally feel it is very significant, that a task we once thought reserved only to human thought is now shown possible by machines.  It’s easy to forget that when Deep Blue defeated Kasparov in 1996, that task was also thought to be impossible, and now chess programs running on cell phones can beat grandmasters.  Before we know it, deep QA software will be running on cell phones as well.

This is also setting the stage for what is considered the hallmark test of AI, the Turing Test.  Again this task still seems impossible, but perhaps a bit less so after Watson.  After the Turing Test is passed, people will debate on whether it is “true” intelligence, but if you cannot measure the difference between computer intelligence and human intelligence, it doesn’t seem to be a very valuable distinction.

Advertisements

E2K9 update

February 8, 2009

I love emailing customer service and getting back meaningless form letters.  Trying to get to the bottom of what happened to Evite on New Year’s Eve led to a typical exchange, including once where they did not respond, so I replied again adding a few corporate CC’s,  This is an excellent tactic to getting a response.  Anyway seven emails later I ended up with the following information from a senior CSR:

“This outage was due to unforeseen system difficulties.”  and “Unfortunately,  . . . we cannot provide the specific system issues.”

As you can see from the following screen shot, this contradicts the message that the disruption was due to “planned maintenance.”

Evite was very fortunate that there was not more public fallout from this.

How Shazam Works

January 10, 2009

There is a cool service called Shazam, which take a short sample of music, and identifies the song.  There are couple ways to use it, but one of the more convenient is to install their free app onto an iPhone.  Just hit the “tag now” button, hold the phone’s mic up to a speaker, and it will usually identify the song and provide artist information, as well as a link to purchase the album.

What is so remarkable about the service, is that it works on very obscure songs and will do so even with extraneous background noise.  I’ve gotten it to work sitting down in a crowded coffee shop and pizzeria.

So I was curious how it worked, and luckily there is a paper written by one of the developers explaining just that.  Of course they leave out some of the details, but the basic idea is exactly what you would expect:  it relies on fingerprinting music based on the spectrogram.

Here are the basic steps:

1. Beforehand, Shazam fingerprints a comprehensive catalog of music, and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10 second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.

Here’s how the fingerprinting works:

You can think of any piece of music as a time-frequency graph called a spectrogram.  On one axis is time, on another is frequency, and on the 3rd is intensity.  Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise.  Here’s one example of how a song might look:

shazam-spectrogram

Spectrogram of a song sample with peak intensities marked in red. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 1A,B.

The Shazam algorithm fingerprints a song by generating this 3d graph, and identifying frequencies of “peak intensity.”  For each of these peak points it keeps track of the frequency and the amount of time from the beginning of the track.  Based on the paper’s examples, I’m guessing they find about 3 of these points per second. [Update: A commenter below notes that in his own implementation he needed more like 30 points/sec.]  So an example of a fingerprint  for a 10 seconds sample might be:

Frequency in Hz Time in seconds
823.44 1.054
1892.31 1.321
712.84 1.703
. . . . . .
819.71 9.943

Shazam builds their fingerprint catalog out as a hash table, where the key is the frequency.  When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs.  Their hash table might look like the following:

Frequency in Hz Time in seconds, song information
823.43 53.352, “Song A” by Artist 1
823.44 34.678, “Song B” by Artist 2
823.45 108.65, “Song C’ by Artist 3
. . . . . .
1892.31 34.945, “Song B” by Artist 2

[Some extra detail: They do not just mark a single point in the spectrogram, rather they mark a pair of points: the “peak intensity” plus a second “anchor point”.  So their key is not just a single frequency, it is a hash of the frequencies of both points.  This leads to less hash collisions which in turn speeds up catalog searching by several orders of magnitude by allowing them to take greater advantage of the table’s constant (O(1)) look-up time.  There’s many interesting things to say about hashing, but I’m not going to go into them here, so just read around the links in this paragraph if you’re interested.]

shazam-plot

Top graph: Songs and sample have many frequency matches, but they do not align in time, so there is no match. Bottom Graph: frequency matches occur at the same time, so the song and sample are a match. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 2B.

If a specific song is hit multiple times (based on examples in the paper I think it needs about 1 frequency hit per second), it then checks to see if these frequencies correspond in time.  They actually have a clever way of doing this  They create a 2d plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample.  If there is a temporal relation between the sets of points, then the points will align along a diagonal.  They use another signal processing method to find this line, and if it exists with some certainty, then they label the song a match.

Add to DeliciousAdd to DiggAdd to FaceBookAdd to Google BookmarkAdd to MySpaceAdd to NewsvineAdd to RedditAdd to StumbleUponAdd to TechnoratiAdd to Twitter

Z2K9 and E2K9

January 4, 2009

This New Year’s we witnessed 2 catastrophic tech failures:  Zune and Evite.  On New Year’s Eve, Zune 30s all over the world froze.  Microsoft said it was due to “‘a bug in the internal clock driver,” and that affected Zunes would work the following day.  For evite, the site went down on one of the busiest party nights of the year around 6pm EST and did not come up until after the ball dropped.  I, along with possibly millions of others, could not access NYE party information which is only stored on the evite site.  This is particularly bad as evite has more competition than ever, and people are just looking for reasons to switch to something else more web 2.0-ish.

So, some industrious poster tracked down the Zune clock source code.  Here is the bad code, which is called whenever the Zune needs the current date:

year = ORIGINYEAR; /* = 1980 */

while (days > 365)
{
    if (IsLeapYear(year))
    {
        if (days > 366)
        {
            days -= 366;
            year += 1;
        }
    }
    else
    {
        days -= 365;
        year += 1;
    }
}

This method take an argument of “days”, which is the number of days since 1/1/1980.  If days=366 and isLeapYear() is true, then this loop does nothing and will loop forever.  These conditions always occur on the last day of a leap year (e.g 12/31/2008) !  The bottom-line is that when you write a loop, you need to be certain it will terminate.  Having a convoluted assortment of if-statements it not a good way to accomplish this.

So we know the Zune crash was caused by a date bug, but the evite error has not been explained.  I donned my investigative reporting hat, and sent an email to evite support asking what went wrong on NYE.  They replied with an email which clearly was anot written by a human, which lead me to believe that perhaps the outage was caused by a take-over of robots.  Anyway, I did not give up that easily and replied again this time with a concise: “What caused the outage?”  I haven’t heard back yet, but I’ll keep you posted!

Universal Instruments

December 21, 2008

I’ve written before about how great Netflix streaming movies are.  The library isn’t spectacular; however, if you’re into documentaries, you can instantly watch a huge selection.  I love docs, and over the past couple weeks I’ve seen a few specifically about artists: Rivers and Tides, Scratch, B-Boy Planet, and Touch the Sound.

Scratch is about turntabalism, a musical form where the instruments are record players.  While these turntables produce highly unique sounds, they also are capable of mimicking any other instrument depending on the given input (i.e. record).  This will instantly remind any computer scientist of universal Turing machines, which are capable of mimicking any other Turing machine.  (Turing Machines, first defined by Alan Turing, are the basis of all computation including human intelligence (more about this in future posts)).

A musical instrument which can mimic any other might be called a “universal instrument.”  It seems there are 2 categories of universal machines: audio players which can play an arbitrary input (e.g. record players, cd players, mp3 players), and instruments which can change their tones in real-time (e.g. tone generators, synthesizers, computers).  Any device which can output arbitrary sine waves in unison can re-create any other instrument due to Fourier decomposition.

Some might debate whether a cd player is really an instrument, but it becomes clear when you realize a musician can manipulate the output through even rudimentary controls (e.g. seeking, volume).   A more realistic debate is whether such a player is really a universal instrument, if all of the information is only in the input string.  Interestingly, this parallels a similar debate about a disputed univeral Turing machine! Stephen Wolfram offered $25,000 for a proof that this machine was in fact universal.  It only has 2 states and 3 symbols and is the simplest possible universal machine, although some have debated whether the universality is only a result of a specific input, which actually encodes the universality, and not the machine itself.