A computer on Jeopardy soundly defeated 2 of the best human players of all-time in a 2-game event. In this article I will attempt to explain how this was achieved, using as my main source an article published in AI Magazine by some of the lead engineers on the project. I’ll focus on the software algorithms which made this apparent “thinking” possible, and gloss over most of the hardware specifics as well as even the Jeopardy domain specifics. In some ways the hardware and Jeopardy rules are arbitrary, since within 10 years we’ll all be able to run a “Watson” on our cell phones and, as has already been done, Watson can be applied successfully to other non-Jeopardy QA tasks. At the end of this article, I’ll add my own opinion as to the significance of this achievement.
Watson is the product of IBM’s DeepQA project, whose goal is to develop a computer system which excels at the classic AI problem of question answering (QA). As with any complicated engineering system, one of the best ways to learn how it works is with a high-level diagram:
Two striking attributes of this system that are clear from the diagram, is that Watson does a 2-pass search. First it does a broad “primary search”, and then it performs a more detailed “evidence scoring” search. I’ll talk more about these below. The other feature which stands out is that it has a parallel path (the gray boxes) for breaking up a single question into multiple ones, based on the question having multiple parts and various potential meanings due to ambiguities in the language. Watson will evaluate all of the resulting questions in parallel and compare the results. Anyway, let’s go into more detail on each of the steps.
Before the first question can even be answered, Watson must attain its knowledge. This is a non-real-time process which is done before and independently of the actual question-answering process. This knowledge-base contains many sources of structured (e.g. a database or triplestore) and unstructured (e.g. text passages) data.
. . . the sources for Watson include a wide range of encyclopedias, dictionaries, thesauri, newswire articles, literary works, . . . dbPedia, WordNet (Miller 1995), and the Yago ontology. (Ferrucci 2010, p. 69)
Watson will access this content during the real-time answering process within the Hypothesis Generation and Hypothesis and Evidence scoring processes which are described below.
When a human is asked a question, deriving the semantic meaning of that question (what the question is actually asking), is usually the easy part, but for AI, understanding even simple questions is a formidable task. Question analysis is where Watson derives how many parts or meanings are embedded in the question. Later in the process, Watson will analyze the different questions separately (the gray boxes in the above diagram) and compare the results. In this stage, Watson also determines the expected lexical answer type (LAT).
. . . a lexical answer type is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Determining whether or not a candidate answer can be considered an instance of the LAT is an important kind of scoring and a common source of critical errors. (Ferrucci 2010, p. 70)
The next step of the process is to use the data from the question analysis step to generate candidate answers (hypotheses). It does this by executing a primary search across its answer sources (described above) in a process known as information retrieval, for which Internet search engines are a common example.
A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches (for example, Indri and Lucene), document search as well as passage search, knowledge base search using SPARQL . . . . (Ferrucci 2010, p. 71)
These results are used to generate candidate answers, which will later be scored and compared. These hypotheses are created in various ways depending on the type of data they came from. For example an answer from a text passage (unstructured) might come from named entity recognition techniques, while an answer from a knowledge base (structured) data set might be the exact result of the query.
If the correct answer(s) are not generated at this stage as a candidate, the system has no hope of answering the question. This step therefore significantly favors recall over precision, with the expectation that the rest of the processing pipeline will tease out the correct answer, even if the set of candidates is quite large. (Ferrucci 2010, p.72)
Watson is configured to return about 250 candidate answers during this stage, which are then trimmed down to about 100 after soft filtering. These filters are simple, more efficient filters than the intensive evidence scoring which follows. A basic soft filter determines the likelihood of the candidate answer of being of the correct LAT, which was determined in the analysis stage.
Hypothesis and Evidence Scoring
In this step, Watson performs new searches which look for evidence of the generated hypotheses being the correct answer. For example:
One particularly effective technique is passage search where the candidate answer is added as a required term to the primary search query derived from the question. This will retrieve passages that contain the candidate answer used in the context of the original question terms. (Ferrucci 2010, p. 72)
The candidate answers and their corresponding evidence are them sent to various scorers. Watson uses about 50 different scorers which work in parallel and then compare their results at the end.
These scorers consider things like the degree of match between a passage’s predicate-argument structure and the question, passage source reliability, geospatial location, temporal relationships, taxonomic classification, the lexical and semantic relations the candidate is known to participate in, the candidate’s correlation with question terms, its popularity (or obscurity), its aliases, and so on. (Ferrucci 2010, p. 72)
Final Merging and Ranking
This final processing stage first merges answers which are deemed equivalent by using basic matching techniques and more complicated coreference resolution algorithms. Scores from equivalent answers are then combined. Finally, Watson uses machine learning techniques to assign a confidence level to each of the merged answers, on how likely they are to be the correct. Depending on the confidence level, the highest ranking hypothesis is presented as the answer.
There is much debate over the significance of this achievement, even within the AI community. I personally feel it is very significant, that a task we once thought reserved only to human thought is now shown possible by machines. It’s easy to forget that when Deep Blue defeated Kasparov in 1996, that task was also thought to be impossible, and now chess programs running on cell phones can beat grandmasters. Before we know it, deep QA software will be running on cell phones as well.
This is also setting the stage for what is considered the hallmark test of AI, the Turing Test. Again this task still seems impossible, but perhaps a bit less so after Watson. After the Turing Test is passed, people will debate on whether it is “true” intelligence, but if you cannot measure the difference between computer intelligence and human intelligence, it doesn’t seem to be a very valuable distinction.