Search Nuggets » relevance

How Elasticsearch calculates significant terms

André Lynum — Wed, 10 Jun 2015 11:02:28 +0000

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

0 \\ 0 & elsewhere \end{matrix}\right. ' title=' JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right. ' class='latex' />

Here the is the frequency of the term in the foreground (or query) document set, while is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in .

By keeping fixed and keeping in mind that both it and is positive we get the following function plot. Note that is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when and approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

Furtunately the decreasing part of the function is in an area where and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as increases.

Looking at the level sets for the JLH score there is a quadratic relationship between the and . Solving for a fixed level we get:

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

An increase in must be offset by approximately a square root increase in to retain the same score.

As we see the score increases sharply as increases in a quadratic manner against . As becomes small compared to the growth goes from linear in to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between and which means once there is significant difference between the two the will dominate the score ranking. The factor primarily makes the score sensitive when this factor is small and for reasonable similar the decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.

Enterprise Search Optimization (ESO)

Christoffer Vig — Sat, 10 Jan 2015 11:45:54 +0000

So, you got your enterprise search engine, but still can’t find what you’re looking for? It’s time to stop your sobbing and learn to play the exciting game of Enterprise Search Optimization (ESO).

Enterprise search differs from web search in some fundamental ways. But there are also similarities. Since we all know how successful web search is, let’s see if there is something to learn by examining the differences.

	Web search	Enterprise search
Search Engine	Google, Bing, Baidu…	SharePoint, Elasticsearch, Solr, Virtualworks, Autonomy, GSA…
Sources	Web pages, web applications (++)	databases, file shares, intranet, web pages, email, SAP, CRM…
Content ambitions	everything	limited
Authority ranking	Pagerank	custom
Control of search engine	web search company	tech department, power users
Control of content	user	user
Writer/reader ratio	low	high

The most striking similarity is that both solutions involve content produced by a user.

The main differences are the search engine and who controls it, the different types of content sources, and the use of the pagerank algorithm.

Web content and web search

What makes web search so successful? Web search was revolutionized when Google introduced their web search using the pagerank algorithm. Pagerank uses the natural structure of the world wide web, and assigns high weight to pages with many incoming links. It rests on the assumption that pages with correct and important information will be used as references on other pages. Along with pagerank, there is a large number of other factors used to drive relevancy; content quality, keywords, social media sharing etc. Most of the details are not publicly available.

Content publishers on the world wide web can use the tricks of Search Engine Optimization (SEO) to make sure their content gets optimal visibility on the web. SEO is the art of combining knowledge of two things;

- how web search engines work

- what search terms people use.

Both of these areas involve a lot of guessing. Some information can be found in guides such as the Google Webmaster Guidelines, which explains what a webmaster can do to make sure her web sites are properly indexed. Parts of this read almost like an instruction on how to create a nice school paper: “Create a useful, information-rich site, and write pages that clearly and accurately describe your content.”

By following these guidelines, you are helping web search engines understand your content.

Enterprise content and enterprise search

Enterprise search is a different story. Content is gathered from different sources, with varying degrees of structure, and mostly without links that could be used for pagerank.

Content publishers in an enterprise search solution are on their own, with no official guidelines describing the rules to follow to win top ranking on the intranet search. More often than not, nobody knows how the enterprise search engine really works. Compare this to the web search situation, and it should not come as a surprise if enterprise search sucks.

A solution to this dilemma requires taking a step back from the idea that enterprise search is a box that you can plug in to your intranet and “there was search”.

Search tech guys often pride themselves in the abilities of their search engines, and will rather fix relevancy problems created by bad content by doing tricks on the technical side of things. On the other side of the story, content producers expect search to “just work”, and put all the responsibility on technology and the implementer.

Creating a great enterprise search solution requires cooperation between the makers of content and the makers of search solutions.

Content producers should know how their content will end up in search. They should know what factors affect findability. Search solutions should have documentation targeted towards the end user, which in the enterprise also might be a content producer.

ESO

We can define Enterprise Search Optimization (ESO) as the art of improving Enterprise Search. Where ESO has been applied, we should expect to find a well functioning search solution, where employees and content producers know how to create easily findable content.

Compared to SEO, Enterprise search optimization is a simple procedure, involving little guesswork in regard to figuring out how the search engine works. It is also a difficult procedure, since ESO needs to be individually tailored and optimized for the specific informational needs for each enterprise. To develop ESO guidelines, the search technicians need to sit down with the content producers and users to figure out the details of the information model and where the pain of missing information hurts the most.

ESO should result in a list of guidelines, or rules, similar to the lists of SEO. These rules can range from simple and obvious, making sure documents have descriptive titles, correct dates and author, to more complex involving consistent language use, metadata fields for categorization, etc. ESO rules should also explain how structure is imposed on data with less structure.

Recognizing authoritative content is solved in web search engines by using the pagerank algorithm. Enterprise search will rarely be able to use pagerank directly. Authority can often be determined by other means. This can range from simple facts like “This book is the company procedure bible” to “powerpoint is more important than word”.

Optimizing enterprise search

Enterprise search can suck a little less by applying a customized version of SEO.

Dictatorial control over recipe search results using elasticsearch and function_score

Christoffer Vig — Fri, 11 Jul 2014 06:02:06 +0000

Once the design for the seasonal recipes app started coming into place, we soon saw there was something fishy about the results. Elasticsearch and custom relevancy to the rescue!

Warning! Dynamic scripting has been disabled by default in elasticsearch version 1.4.3. Using the technique in this article now requires some extra steps. Details on the Elastic blog.

Our design explains that we need results to be sorted by number of ingredients, and then by date, with the most recent recipes on top, scoring recipes from 2008 at bottom. The original attempt involved a simple “terms” query.

Investigating the results for July, in the garden, gave us recipes for jam at the top, with the count of matching ingredients being only 3 for the top hit. The list of ingredients in season for July in the garden is quite long, but all you need to know is that it contains “rips“, a little red sour berry, and “poteter”, that is potatoes.

You can find the queries used in this post at http://sense.qbox.io

For this query, we were some what surprised that none of the recipes on top contained potatoes. Using the highlight function, it is easy to see that the number of ingredients returned for the top hits should have been higher. Then it dawned upon us: TF-IDF! You’re messing up again! What we see is just the normal relevancy, promoting the least commonly used terms to the top. This works well for natural language queries, but that’s not really what we are doing here.

Luckily, elasticsearch doesn’t leave us stuck in a rut. We implemented a custom scoring function using the function score query.

For this scoring, we don’t need any points from the query terms, we just want to replace the default scoring with our custom one (boost_mode = replace).
The scoring function has two parts, one where we add up the number of ingredients, and one part to add some boost to the most recent posts.

The function to sum up term frequencies looks like this:

The tf() function returns the term frequency for a term in this field. There are a number of functions you can use to perform your own calculations based on index properties. The functions available are documented in the text scoring in scripts and the scripting module.

To the term frequency we add the date scoring:

We are using a linear function, but we could also have used gauss or exponential curves.

The “scale” parameter decides the point on the graph where the value specified in “decay” should be found. Setting it to 700 days allows the scoring to reach 0 for recipes dated in 2008, which was one of the requirements.

The number of ingredients will always be whole numbers, while the date scoring is normalized to values from 1 to 0.

To allow the scoring from both functions to add up, we use the parameter score_mode=sum.

Elasticsearch is an extremely powerful toolbox for search, information retrieval, analytics, big data, you name it. The possibilites are endless.

If you want to learn more about custom scoring in elasticsearch, there are some nice videos you can watch:

“Scoring for human beings” by Britta Weber, Berlin Buzzwords 2014

http://www.elasticsearch.org/videos/introducing-custom-scoring-functions/

How to visualize absolute search result quality

Espen Klem — Thu, 08 May 2014 16:51:56 +0000

Earlier, I’ve looked into how I could use the Phi spiral to possibly get a better display of what’s most relevant in a search result. A former colleague of mine, Johannes Hoff Holmedahl, did a quick test on the theory, and it may actually work.

For the recipe app I want to do something slightly different, showing the absolute search result quality for each result. In other words: If the best search result in a result set is not very good, make it smaller than if it has higher value, thus showing an absolute value for each result. The dialogue equivalent comparing i.e. two movies would be to define the best of them better than the other, but not the best you’d seen.

Absolute search result quality

aka. Absolute visual relevance hierarchy

How do we then measure absolute quality? In an earlier post I described what would be our relevancy hierarchy. The more in-season ingredients in a recipe, the better quality.

I’ve defined three quality groups for absolute search result quality so far (number of ingredients: n):

n >= 4
1 < n < 4
n = 1

Here are three different examples on search results set. First has two results with four or more in-season ingredients and second has only one. Third has none, typically something that would happen during the winter months in Norway:

The recipe app is work in progress. Check back every now and then for new blog posts on the subject.

Relevance tuning in the search domain. What is it exactly?

Espen Klem — Fri, 07 Mar 2014 16:26:06 +0000

First thing first! Let’s get rid of the bullshit bingo lingo: “Relevancy tuning” in search is a fancy description for something that’s not very magical, even if it sounds like just that. It’s about getting the right results on top of your search result. End of story. If somebody asks you a question, you should start by giving that person the most likely answer first. Most search engines seems to be digressing. It’s because we haven’t told in a clear manner what to be expected from them. And because we often use generic tools to solve specific problems.

One generic tool for getting the right results on top is the “term frequency–inverse document frequency“, or tf-idf for short. It’s a combination of how often a term is mentioned in a document compared to how often it’s mentioned in all of your documents in the index. So, rare terms within the whole index used often in one document makes it a good search result when searching for that term. But most likely, not good enough. You need to figure out what’s the characteristics of your content, and what are the most characteristic use cases and user stories for your users. Only then can you achieve great relevancy, … I mean get the right result on top of your search result.

Model for relevance tuning

We’ll use our Recipe app as an example…

So, for our food recipe app, we have some obvious content characteristics:

The more ingredients in-season for one recipe is good. We’re doing an OR-search on all ingredients in-season so this comes out-of-the-box … almost.
Quite a lot of recipes doesn’t stand the test of time. We know that most of the recipes at oppskrift.klikk.no from 2008 or newer are quite good and have nice photos.
We’re not sure if we need this, but we know whom of the writers to trust. This may be an overkill when we already have a boosting on newer recipes.

And we know a lot about our users as well:

Most grown up people in Norway have a job, thus limited time to prepare a meal. This means that recipes that takes shorter preparations should be boosted from Monday through Thursday. The verdict on Friday is still not decided.
During the weekend people have more time to make dinner. The recipes that takes a short time to prepare most probably cut some corners, and are not that good compared to recipes that takes a little longer time. So for the weekends, we should do a demotion of really quick recipes, at least for dinners.

This is the info we’re going to use to sort our search result. But we have more knowledge about our users that we can use to auto-set filters for certain times of the day:

Most work days, people don’t plan a breakfast meal or lunch. The whole day we can auto-set the main “course” filter.
During the weekend, people may also plan a lunch. We’ve decided to auto-set the “light meal”-filter during weekends up until lunch time. After that the “main course” filter is auto-set. We’ll log if the first thing our users do is to set another filter.
On Friday and Saturday a lot of Norwegians drink beer, wine or liquor. After some hours of drinking, they get hungry. Maybe we should have an “afterparty, quick and greasy and tasty-meal”-filter auto-set for late Fridays and Saturdays?

So, what’s the filters we’ve decided on:

Light meals
Starters
Main courses
Deserts
… and maybe the Afterparty-thingy

What do you think?

Sounds nice? This is work in progress, so check back every now and then for new blog posts.

The seasonal recipe app: Tapping into the mental model

Espen Klem — Fri, 07 Feb 2014 15:09:26 +0000

Our mental model for the seasonal recipe app is helping people use the best ingredients for any particular time of year is the goal for our little demo search app. Since a lot of people in Norway actually go into the nature and forage, fetch, pick, shoot and fish their own food, we wanted to divide into some of the most typical and normal places where you can find those types of food. We then have two variables for our search: Place where you find the food and Time of year (Month).

Tapping into the mental model

Each variable combination will do an OR-search containing a lot of ingredients for that particular place and time of year. Our relevancy model so far:

Recipes with the highest amount of ingredients hits
Newest recipes
Recipes written by Christopher Sjuve

Number one on the list is given, but why number two and three? A lot of the older recipes doesn’t stand the test of time, and we know we trust Christopher Sjuve’s recipes.

The places we’ve chosen:

The sea
The farm
The garden
The forest
The mountain

It’s an odd bunch of places, but so far we think it will work. Logs and usability testing will tell us later if we’re hitting the target or not. The farm doesn’t fit that well with the others, since you don’t normally enter a farm and steal a cow or some potatoes. But it will be what’s closest in content to your average supermarket, and will be the default choice. Almost all of the places will have some overlapping ingredients. Each search is a combination of a place and a month. 5 places x 12 months means a sprite of 60 images where you swipe horizontal to select a place. Month will be selected for you, but to open up for exploration we think it will be valuable to have a vertical swipe to select month.

Here’s the first wireframes on the UX concept.

First version of the query matrix. So far not organized by places, but types of ingredients.

The search is already up and running, but lack every sign of graphical user interface.

Thanks to Qbox.io for letting us use a Hosted Elasticsearch instance for this project!

Sounds nice? This is work in progress, so check back every now and then for new blog posts.

Visual relevancy hierarchy creating a better search result using the Phi spiral?

Espen Klem — Fri, 05 Jul 2013 12:46:56 +0000

Today, most search solutions will give you the results as a list from 1 to 10. Problem is, they’re not very appealing, and don’t do the task at hand very well. At the top of the list, it’s okay. Number 1 gets most clicks, number 2 a little less, number three even less, but then in the middle, a lot of results get less than the ones at the bottom.

Using the phi-spiral as a visual relevancy hierarchy

So, how could the Phi spiral help us?

A search engine list out what it thinks is the most important first. But the list has several issues:

You could, but should you?
Just because your template really want you to render a logic list as a visual list, doesn’t mean you have to do it like that.
Not representing the information well
A news article looks like a news article, no matter which version you see: The front page teaser, a short version or the full blown thingy. But a search result almost always looks like just that: A dull list of items.
Too many items
Results at the bottom of the list tends to get higher click rates than just above the bottom. I guess this have to do with how people scan web content in F-shaped patterns and that a list of 10 items is too much information for the user to digest.

So, what if we started to use space and position to show relevancy? The Phi spiral, building on the Fibonacci sequence would make a nice search result.

Phi spiral as a search result. We’ll get all sorts of other issues, but I think it’s a good start to getting somewhere better.

You would maybe not be able to show more than 5 result items, but we could put the search box in the middle of the page and then get 6 items.

Looking less like a search result and more like a content filled page.

So, what do you think? Others are using treemaps: Newsmap.jp. Not a very usable implementation, but a nice idea.

Using internal rank metrics in external search engines

Marcus Johansson — Thu, 03 Feb 2011 13:20:51 +0000

…and how the hidden web can be revealed

In the current flame war between Google and Bing, there is a good amount of pie-throwing going on around the internet. But in the process, some very interesting tech stuff has surfaced as well. We’ve got a glimpse on one of the many components Bing is using as a measurement of relevance, namely click-stream data from real users.

Disregarding the Google vs. Bing dispute; the use of click-stream data (aka browser usage statistics), is very interesting in a search engine perspective. This is because relevance of search queries is often described as a combination of the precision of the results and the recall of the query, and click-stream data can help increasing them both.

Here’s one of the reasons why.

If you spend a lot of time on a particular web site, you have probably used its search engine. Quite often, you can choose to search through the site using Google or Bing. But a reason why it makes sense to use the site’s own search engine is that in theory, the particular site can always build a better search experience than what anyone else ever could. They can rank the documents using important internal metrics such as e.g. upvotes (Reddit), social distance (LinkedIn) reputation (Stack Overflow), and retweets (Twitter). The list goes on.

Now. If you happen to be able to watch what users do on these particular web sites, you would in fact be able to lift some of that domain-specific data (upvotes, distance, reputation, and retweets) into your own machinery. Perhaps not directly, but at least indirectly.

If someone searches on Reddit for a certain comment thread, their search engine will (supposedly) produce the best match, ranked according to how many upvotes and comments that particular thread has accumulated. Two things are of interest:

The user is quite likely to visit a page that was returned high up in the results.
The URL to the results is likely to contain something like “search=TERMS” or “query=TERMS”.

If you happen to collect browser usage statistics you can draw the conclusion that the documents that the user clicked on are highly relevant according to the site’s internal metrics – whatever those might be. Better yet, as you can analyze the URL to the actual search form you also know the particular query terms that the user typed to find these documents. And you can adjust your search engine accordingly.

Simply put, you have now leveraged a web site’s internal data in your external search engine. Consequently adding to the precision of the results using a previously unreachable metric.

Additionally, this tactic will open up more of the “invisible web”. For example, it is not uncommon for government sites to contain big amounts of data, but the only way to get to it is by running queries through an often poorly designed search form. Links into the data sets are rarely provided, so the actual data remains hidden from the search engines.

Until now, as click-stream data allows the search engines to discover it by piggybacking on users’ browsing sessions. Thus, adding to the recall of the queries.