Elasticsearch significant terms score calculationSearch Nuggets

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

$JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right.$

Here the $p_{fore}$ is the frequency of the term in the foreground (or query) document set, while $p_{back}$ is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in $p_{fore}$ .

$(p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}$

By keeping $p_{back}$ fixed and keeping in mind that both it and $p_{fore}$ is positive we get the following function plot. Note that $p_{back}$ is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

$\nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)$

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when $p_{fore}$ and $p_{back}$ approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

$\begin{aligned} \frac{2 p_{fore}}{p_{back}} - 1 & < 0 \\ p_{fore} & < \frac{1}{2}p_{back} \end{aligned}$

Furtunately the decreasing part of the function is in an area where $p_{fore} - p_{back} < 0$ and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around $\frac{1}{2}p_{back}$ we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as $p_{fore}$ increases.

$JLH_{mod} = \frac{p_{fore}^2}{p_{back}}$

Looking at the level sets for the JLH score there is a quadratic relationship between the $p_{fore}$ and $p_{back}$ . Solving for a fixed level $k$ we get:

$\begin{aligned} JLH = & \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\ & p_{fore}^2 - p_{fore} - k\cdot p_{back} = 0 \\ & p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2} \end{aligned}$

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

$\begin{aligned} JLH = & \frac{p_{fore}^2}{p_{back}} = k \\ & p_{fore} = \sqrt{k \cdot p_{back}} \end{aligned}$

An increase in $p_{back}$ must be offset by approximately a square root increase in $p_{fore}$ to retain the same score.

As we see the score increases sharply as $p_{fore}$ increases in a quadratic manner against $p_{back}$ . As $p_{back}$ becomes small compared to $p_{fore}$ the growth goes from linear in $p_{fore}$ to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between $p_{fore}$ and $p_{back}$ which means once there is significant difference between the two the $p_{fore}$ will dominate the score ranking. The $p_{back}$ factor primarily makes the score sensitive when this factor is small and for reasonable similar $p_{back}$ the $p_{fore}$ decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.

Mark

June 11, 2015 at 10:39 | Permalink

Nice write up. I plotted some of the other scoring heuristics here: https://twitter.com/elasticmark/status/513320986956292096

For me, the choice of scoring function is essentially a question of emphasis between precision and recall. If I analyse the results of a search for “Bill Gates” do I suggest “Microsoft” or “bgC3″ as the most significant terms? The rarer bgc3 is more closely allied (so high precision) but perhaps of less practical use due to poor recall. The inverse is true of Microsoft.

Putting aside the question of precision/recall emphasis, I found the following features do the most to improve significance suggestions on text:
1) Removal of duplicate/near duplicate text in the results being analysed
2) Discovery of phrases
3) Analysing top-matching samples not all content (see the new “sampler” agg)
4) “Chunking” documents e.g. into sentence-docs if the foreground sample of docs is low in number.
Additionally, text analysis in elasticsearch should ideally not be reliant on field data to avoid memory issues.
These approaches are a little detailed to get into here but are important parts of improving elasticsearch’s significance algos on free text.

The mystery of “JLH” is my daughter’s initials :)
Cheers,
Mark

André Lynum

June 11, 2015 at 12:02 | Permalink

Ayup, thanks letting us in on the name and mentioning the new sampler aggregation :) That looks quite interesting.

The precision/recall viewpoint is indeed illuminating. I tend to shy away from it since it feels tied to specific task or goal. I’m not really sure I would call bgC3 more precise than Microsoft in this case, although bgC3 probably points to a more limited set of documents. It’s all in the context I guess.

If you can structure your data in a manner that supports your task that is certainly a good idea. Chunking up words I feel is something that has to be done with more care. Pure collocational or n-gram approaches can end up emphasizing turns of phrase without interesting content, you might call them phrase stopwords. I’ld be more inclined to extract meaningful multiword units such as verb complexes or entities.

Doing sentence/paragraph docs incurs indexing costs and seem to me to be a poor mans estimate of the term-frequency. It’s certainly a practical solution but it would nice if there was an option for using the TF in the significant terms aggregation for foreground sets where it is feasible.

Anyhow, thanks for putting cool stuff like this into Elasticsearch and hope you’ll find ways to make significant terms into an even more precise and flexible tool than it is today.

Snakkes!

André

Search Nuggets by

Blogging about search as the solution

How Elasticsearch calculates significant terms

Related Posts

2 response to: «How Elasticsearch calculates significant terms»

Leave a response