INSIDE THE MIND OF A SEARCH ENGINE

Taking Things Literally

If I handed you stack of newspapers and magazines and asked you to pick out all of the articles having to do with French Impressionism, it is very unlikely that you would pore over each article word-by-word, looking for the exact phrase. Instead, you would probably flip through each publication, skimming the headlines for articles that might have to do with art or history, and then reading through the ones you found to see if you could find a connection.

If, however, I handed you a stack of articles from a highly technical mathematical journal and asked you to show me everything to do with n-dimensional manifolds, the chances are high (unless you are a mathematician) that you would have to go through each article line-by-line, looking for the phrase "n-dimensional manifold" to appear in a sea of jargon and equations.

The two searches would generate very different results. In the first example, you would probably be done much faster. You might miss a few instances of the phrase French Impressionism because they occured in an unlikely article - perhaps a mention of a business figure's being related to Claude Monet - but you might also find a number of articles that were very relevant to the search phrase French Impressionism, even though they didn't contain the actual words: articles about a Renoir exhibition, or visiting the museum at Giverny, or the Salon des Refusés.

With the math articles, you would probably find every instance of the exact phrase n-dimensional manifold, given strong coffee and a good pair of eyeglasses. But unless you knew something about higher mathematics, it is very unlikely that you would pick out articles about topology that did not contain the search phrase, even though a mathematician might find those articles very relevant.

These two searches represent two opposite ways of searching a document collection. The first is a conceptual search, based on a higher-level understanding of the query and the search space, including all kinds of contextual knowledge and assumptions about how newspaper articles are structured, how the headline relates to the contents of an article, and what kinds of topics are likely to show up in a given publication.

The second is a purely mechanical search, based on an exhaustive comparison between a certain set of words and a much larger set of documents, to find where the first appear in the second. It is not hard to see how this process could be made completely automatic: it requires no understanding of either the search query or the document collection, just time and patience.

Of course, computers are perfect for doing rote tasks like this. Human beings can never take a purely mechanical approach to a text search problem, because human beings can't help but notice things. Even someone looking through technical literature in a foreign language will begin to recognize patterns and clues to help guide them in selecting candidate articles, and start to form ideas about the context and meaning of the search. But computers know nothing about context, and excel at performing repetitive tasks quickly. This rote method of searching is how search engines work.

Every full-text search engine, no matter how complex, finds its results using just such a mechanical method of exhaustive search. While the techniques it uses to rank the results may be very fancy indeed (Google is a good example of innovation in choosing a system for ranking), the actual search is based entirely on keywords, with no higher-level understanding of the query or any of the documents being searched.

John Henry Revisited

Of course, while it is nice to have repetitive things automated, it is also nice to have our search agent understand what it is doing. We want a search agent who can behave like a librarian, but on a massive scale, bringing us relevant documents we didn't even know to look for. The question is, is it possible to augment the exhaustiveness of a mechanical keyword search with some kind of a conceptual search that looks at the meaning of each document, not just whether or not a particular word or phrase appears in it? If I am searching for information on the effects of the naval blockade on the economy of the Confederacy during the Civil War, chances are high that a number of documents pertinent to that topic might not contain every one of those keywords, or even a single one of them. A discussion of cotton production in Georgia during the period 1860-1870 might be extremely revealing and useful to me, but if it does not mention the Civil War or the naval blockade directly, a keyword search will never find it.

Many strategies have been tried to get around this 'dumb computer' problem. Some of these are simple measures designed to enhance a regular keyword search - for example, lists of synonyms for the search engine to try in addition to the search query, or fuzzy searches that tolerate bad spelling and different word forms. Others are ambitious exercises in artificial intelligence, using complex language models and search algorithms to mimic how we aggregate words and sentences into higher-level concepts.

Unfortunately, these higher-level models are really bad. Despite years of trying, no one has been able to create artificial intelligence, or even artificial stupidity. And there is growing agreement that nothing short of an artificial intelligence program can consistently extract higher-level concepts from written human language, which has proven far more ambiguous and difficult to understand than any of the early pioneers of computing expected.

That leaves natural intelligence, and specifically expert human archivists, to do the complex work of organizing and tagging data to make a conceptual search possible.

< previous next >