Using internal rank metrics in external search engines
…and how the hidden web can be revealed
In the current flame war between Google and Bing, there is a good amount of pie-throwing going on around the internet. But in the process, some very interesting tech stuff has surfaced as well. We’ve got a glimpse on one of the many components Bing is using as a measurement of relevance, namely click-stream data from real users.
Disregarding the Google vs. Bing dispute; the use of click-stream data (aka browser usage statistics), is very interesting in a search engine perspective. This is because relevance of search queries is often described as a combination of the precision of the results and the recall of the query, and click-stream data can help increasing them both.
Here’s one of the reasons why.
If you spend a lot of time on a particular web site, you have probably used its search engine. Quite often, you can choose to search through the site using Google or Bing. But a reason why it makes sense to use the site’s own search engine is that in theory, the particular site can always build a better search experience than what anyone else ever could. They can rank the documents using important internal metrics such as e.g. upvotes (Reddit), social distance (LinkedIn) reputation (Stack Overflow), and retweets (Twitter). The list goes on.
Now. If you happen to be able to watch what users do on these particular web sites, you would in fact be able to lift some of that domain-specific data (upvotes, distance, reputation, and retweets) into your own machinery. Perhaps not directly, but at least indirectly.
If someone searches on Reddit for a certain comment thread, their search engine will (supposedly) produce the best match, ranked according to how many upvotes and comments that particular thread has accumulated. Two things are of interest:
- The user is quite likely to visit a page that was returned high up in the results.
- The URL to the results is likely to contain something like “search=TERMS” or “query=TERMS”.
If you happen to collect browser usage statistics you can draw the conclusion that the documents that the user clicked on are highly relevant according to the site’s internal metrics – whatever those might be. Better yet, as you can analyze the URL to the actual search form you also know the particular query terms that the user typed to find these documents. And you can adjust your search engine accordingly.
Simply put, you have now leveraged a web site’s internal data in your external search engine. Consequently adding to the precision of the results using a previously unreachable metric.
Additionally, this tactic will open up more of the “invisible web”. For example, it is not uncommon for government sites to contain big amounts of data, but the only way to get to it is by running queries through an often poorly designed search form. Links into the data sets are rarely provided, so the actual data remains hidden from the search engines.
Until now, as click-stream data allows the search engines to discover it by piggybacking on users’ browsing sessions. Thus, adding to the recall of the queries.
It’s all about the click stream. Thats why the controversy is really about people being blissfully unaware of how the modern search giants operates. There’s no such thing as a free lunch, and every time we use Google or Bing we inadvertently give something back in the way of tuning their ranking algorithms (well, that and clicking ads of course). It’s actually quite beautiful, and i think it’s a shame the Bing team has received such huge amounts of flak for just doing their job.