INSIDE THE MIND OF A SEARCH ENGINE
Taking Things Literally
If I handed you stack of newspapers and magazines and asked you
to pick out all of the articles having to do with French
Impressionism, it is very unlikely that you would pore over
each article word-by-word, looking for the exact phrase. Instead,
you would probably flip through each publication, skimming the headlines
for articles that might have to do with art or history, and then
reading through the ones you found to see if you could find a connection.
If, however, I handed you a stack of articles from a highly technical
mathematical journal and asked you to show me everything to do with
n-dimensional manifolds, the chances
are high (unless you are a mathematician) that you would have to
go through each article line-by-line, looking for the phrase "n-dimensional
manifold" to appear in a sea of jargon and equations.
The two searches would generate very different results. In the
first example, you would probably be done much faster. You might
miss a few instances of the phrase French
Impressionism because they occured in an unlikely article
- perhaps a mention of a business figure's being related to Claude
Monet - but you might also find a number of articles that were very
relevant to the search phrase French Impressionism,
even though they didn't contain the actual words: articles about
a Renoir exhibition, or visiting the museum at Giverny, or the Salon
With the math articles, you would probably find every instance
of the exact phrase n-dimensional manifold,
given strong coffee and a good pair of eyeglasses. But unless you
knew something about higher mathematics, it is very unlikely that
you would pick out articles about topology
that did not contain the search phrase, even though a mathematician
might find those articles very relevant.
These two searches represent two opposite ways of searching a document
collection. The first is a conceptual search, based on a higher-level
understanding of the query and the search
space, including all kinds of contextual knowledge and assumptions
about how newspaper articles are structured, how the headline relates
to the contents of an article, and what kinds of topics are likely
to show up in a given publication.
The second is a purely mechanical search, based on an exhaustive
comparison between a certain set of words and a much larger set
of documents, to find where the first appear in the second. It is
not hard to see how this process could be made completely automatic:
it requires no understanding of either the search query or the document
collection, just time and patience.
Of course, computers are perfect for doing rote tasks like this.
Human beings can never take a purely mechanical approach to a text
search problem, because human beings can't help but notice things.
Even someone looking through technical literature in a foreign language
will begin to recognize patterns and clues to help guide them in
selecting candidate articles, and start to form ideas about the
context and meaning of the search. But computers know nothing about
context, and excel at performing repetitive tasks quickly. This
rote method of searching is how search engines work.
Every full-text search engine, no matter how complex, finds its
results using just such a mechanical method of exhaustive search.
While the techniques it uses to rank the results may be very fancy
indeed (Google is a good example of innovation in choosing a system
for ranking), the actual search is based entirely on keywords, with
no higher-level understanding of the query or any of the documents
John Henry Revisited
Of course, while it is nice to have repetitive things automated,
it is also nice to have our search agent understand what it is doing.
We want a search agent who can behave like a librarian, but on a
massive scale, bringing us relevant documents we didn't even know
to look for. The question is, is it possible to augment the exhaustiveness
of a mechanical keyword search with some kind of a conceptual search
that looks at the meaning of each document, not just whether or
not a particular word or phrase appears in it? If I am searching
for information on the effects of the naval
blockade on the economy of the Confederacy during the Civil War,
chances are high that a number of documents pertinent to that topic
might not contain every one of those keywords, or even a single
one of them. A discussion of cotton production in Georgia during
the period 1860-1870 might be extremely revealing and useful to
me, but if it does not mention the Civil War or the naval blockade
directly, a keyword search will never find it.
Many strategies have been tried to get around this 'dumb computer'
problem. Some of these are simple measures designed to enhance a
regular keyword search - for example, lists of synonyms for the
search engine to try in addition to the search query, or fuzzy
searches that tolerate bad spelling and different word forms.
Others are ambitious exercises in artificial intelligence, using
complex language models and search algorithms to mimic how we aggregate
words and sentences into higher-level concepts.
Unfortunately, these higher-level models are really bad. Despite
years of trying, no one has been able to create artificial intelligence,
or even artificial stupidity. And there is growing agreement that
nothing short of an artificial intelligence program can consistently
extract higher-level concepts from written human language, which
has proven far more ambiguous and difficult to understand than any
of the early pioneers of computing expected.
That leaves natural intelligence, and specifically expert human
archivists, to do the complex work of organizing and tagging data
to make a conceptual search possible.