INTRODUCTION - THE NEED FOR SMARTER SEARCH ENGINES
As of early 2002, there were just over two billion web pages listed
in the Google search engine index, widely taken to be the most comprehensive.
No one knows how many more web pages there are on the Internet,
or the total number of documents available over the public network,
but there is no question that the number is enormous and growing
quickly. Every one of those web pages has come into existence within
the past ten years. There are web sites covering every conceivable
topic at every level of detail and expertise, and information ranging
from numerical tables to personal diaries to public discussions.
Never before have so many people had access to so much diverse information.
Even as the early publicity surrounding the Internet has died down,
the network itself has continued to expand at a fantastic rate,
to the point where the quantity of information available over public
networks is starting to exceed our ability to search it. Search
engines have been in existence for many decades, but until recently
they have been specialized tools for use by experts, designed to
search modest, static, well-indexed, well-defined data collections.
Today's search engines have to cope with rapidly changing, heterogenous
data collections that are orders of magnitude larger than ever before.
They also have to remain simple enough for average and novice users
to use. While computer hardware has kept up with these demands -
we can still search the web in the blink of an eye - our search
algorithms have not. As any Web user knows, getting reliable, relevant
results for an online search is often difficult.
For all their problems, online search engines have come a long
way. Sites like Google are pioneering the use of sophisticated techniques
to help distinguish content from drivel, and the arms race between
search engines and the marketers who want to manipulate them has
spurred innovation. But the challenge of finding relevant content
online remains. Because of the sheer number of documents available,
we can find interesting and relevant results for any search query
at all. The problem is that those results are likely to be hidden
in a mass of semi-relevant and irrelevant information, with no easy
way to distinguish the good from the bad.
Precision, Ranking, and Recall - the Holy Trinity
In talking about search engines and how to improve them, it helps
to remember what distinguishes a useful search from a fruitless
one. To be truly useful, there are generally three things we want
from a search engine:
- We want it to give us all of the relevant information available
on our topic.
- We want it to give us only information that is relevant to our
- We want the information ordered in some meaningful way, so that
we see the most relevant results first.
The first of these criteria - getting all of the relevant information
available - is called recall. Without
good recall, we have no guarantee that valid, interesting results
won't be left out of our result set. We want the rate of false negatives
- relevant results that we never see - to be as low as possible.
The second criterion - the proportion of documents in our result
set that is relevant to our search - is called precision.
With too little precision, our useful results get diluted by irrelevancies,
and we are left with the task of sifting through a large set of
documents to find what we want. High precision means the lowest
possible rate of false positives.
There is an inevitable tradeoff between precision and recall. Search
results generally lie on a continuum of relevancy, so there is no
distinct place where relevant results stop and extraneous ones begin.
The wider we cast our net, the less precise our result set becomes.
This is why the third criterion, ranking,
is so important. Ranking has to do with whether the result set is
ordered in a way that matches our intuitive understanding of what
is more and what is less relevant. Of course the concept of 'relevance'
depends heavily on our own immediate needs, our interests, and the
context of our search. In an ideal world, search engines would learn
our individual preferences so well that they could fine-tune any
search we made based on our past expressed interests and pecadilloes.
In the real world, a useful ranking is anything that does a reasonable
job distinguishing between strong and weak results.
The Platonic Search Engine
Building on these three criteria of precision, ranking and recall,
it is not hard to envision what an ideal search engine might be
Scope: The ideal engine would be able to search every
document on the Internet
Speed: Results would be available immediately
Currency: All the information would be kept completely
Recall: We could always find every document relevant
to our query
Precision: There would be no irrelevant documents in
our result set
Ranking: The most relevant results would come first,
and the ones furthest afield would come last
Of course, our mundane search engines have a way to go before reaching
the Platonic ideal. What will it take to bridge the gap?
For the first three items in the list - scope, speed, and currency
- it's possible to make major improvements by throwing resources
at the problem. Search engines can always be made more comprehensive
by adding content, they can always be made faster with better hardware
and programming, and they can always be made more current through
frequent updates and regular purging of outdated information.
Improving our trinity of precision, ranking and recall, however,
requires more than brute force. In the following pages, we will
describe one promising approach, called latent
semantic indexing, that lets us make improvements in all
three categories. LSI was first developed at Bellcore in the late
1980's, and is the object of active research, but is surprisingly
little-known outside the information retrieval community. But before
we can talk about LSI, we need to talk a little more about how search
engines do what they do.