< previous next >

INTRODUCTION - THE NEED FOR SMARTER SEARCH ENGINES

As of early 2002, there were just over two billion web pages listed in the Google search engine index, widely taken to be the most comprehensive. No one knows how many more web pages there are on the Internet, or the total number of documents available over the public network, but there is no question that the number is enormous and growing quickly. Every one of those web pages has come into existence within the past ten years. There are web sites covering every conceivable topic at every level of detail and expertise, and information ranging from numerical tables to personal diaries to public discussions. Never before have so many people had access to so much diverse information.

Even as the early publicity surrounding the Internet has died down, the network itself has continued to expand at a fantastic rate, to the point where the quantity of information available over public networks is starting to exceed our ability to search it. Search engines have been in existence for many decades, but until recently they have been specialized tools for use by experts, designed to search modest, static, well-indexed, well-defined data collections. Today's search engines have to cope with rapidly changing, heterogenous data collections that are orders of magnitude larger than ever before. They also have to remain simple enough for average and novice users to use. While computer hardware has kept up with these demands - we can still search the web in the blink of an eye - our search algorithms have not. As any Web user knows, getting reliable, relevant results for an online search is often difficult.

For all their problems, online search engines have come a long way. Sites like Google are pioneering the use of sophisticated techniques to help distinguish content from drivel, and the arms race between search engines and the marketers who want to manipulate them has spurred innovation. But the challenge of finding relevant content online remains. Because of the sheer number of documents available, we can find interesting and relevant results for any search query at all. The problem is that those results are likely to be hidden in a mass of semi-relevant and irrelevant information, with no easy way to distinguish the good from the bad.

Precision, Ranking, and Recall - the Holy Trinity

In talking about search engines and how to improve them, it helps to remember what distinguishes a useful search from a fruitless one. To be truly useful, there are generally three things we want from a search engine:

We want it to give us all of the relevant information available on our topic.
We want it to give us only information that is relevant to our search
We want the information ordered in some meaningful way, so that we see the most relevant results first.

The first of these criteria - getting all of the relevant information available - is called recall. Without good recall, we have no guarantee that valid, interesting results won't be left out of our result set. We want the rate of false negatives - relevant results that we never see - to be as low as possible.

The second criterion - the proportion of documents in our result set that is relevant to our search - is called precision. With too little precision, our useful results get diluted by irrelevancies, and we are left with the task of sifting through a large set of documents to find what we want. High precision means the lowest possible rate of false positives.

There is an inevitable tradeoff between precision and recall. Search results generally lie on a continuum of relevancy, so there is no distinct place where relevant results stop and extraneous ones begin. The wider we cast our net, the less precise our result set becomes. This is why the third criterion, ranking, is so important. Ranking has to do with whether the result set is ordered in a way that matches our intuitive understanding of what is more and what is less relevant. Of course the concept of 'relevance' depends heavily on our own immediate needs, our interests, and the context of our search. In an ideal world, search engines would learn our individual preferences so well that they could fine-tune any search we made based on our past expressed interests and pecadilloes. In the real world, a useful ranking is anything that does a reasonable job distinguishing between strong and weak results.

The Platonic Search Engine

Building on these three criteria of precision, ranking and recall, it is not hard to envision what an ideal search engine might be like:

Scope: The ideal engine would be able to search every document on the Internet
Speed: Results would be available immediately
Currency: All the information would be kept completely up-to-date
Recall: We could always find every document relevant to our query
Precision: There would be no irrelevant documents in our result set
Ranking: The most relevant results would come first, and the ones furthest afield would come last

Of course, our mundane search engines have a way to go before reaching the Platonic ideal. What will it take to bridge the gap?

For the first three items in the list - scope, speed, and currency - it's possible to make major improvements by throwing resources at the problem. Search engines can always be made more comprehensive by adding content, they can always be made faster with better hardware and programming, and they can always be made more current through frequent updates and regular purging of outdated information.

Improving our trinity of precision, ranking and recall, however, requires more than brute force. In the following pages, we will describe one promising approach, called latent semantic indexing, that lets us make improvements in all three categories. LSI was first developed at Bellcore in the late 1980's, and is the object of active research, but is surprisingly little-known outside the information retrieval community. But before we can talk about LSI, we need to talk a little more about how search engines do what they do.

< previous next >