<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; lexical analysis</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/lexical-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>How Elasticsearch calculates significant terms</title>
		<link>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/#comments</comments>
		<pubDate>Wed, 10 Jun 2015 11:02:28 +0000</pubDate>
		<dc:creator><![CDATA[André Lynum]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[aggregations]]></category>
		<category><![CDATA[lexical analysis]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[significant terms]]></category>
		<category><![CDATA[word analysis]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3785</guid>
		<description><![CDATA[Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative [...]]]></description>
				<content:encoded><![CDATA[<div id="attachment_3823" style="width: 310px" class="wp-caption aligncenter"><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon-300x187.png" alt="The &quot;unvommonly common&quot;" width="300" height="187" class="size-medium wp-image-3823" /></a><p class="wp-caption-text">The magic of the &#8220;uncommonly common&#8221;.</p></div>
<p>Many of you who use Elasticsearch may have used the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html" title="significant terms">significant terms aggregation</a> and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in &#8211; garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.</p>
<p>The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH+%3D+%5Cleft%5C%7B%5Cbegin%7Bmatrix%7D++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%26+p_%7Bfore%7D+-+p_%7Bback%7D+%3E+0+%5C%5C++0++%26+elsewhere++%5Cend%7Bmatrix%7D%5Cright.++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' title='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' class='latex' />
<p>Here the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is the frequency of the term in the foreground (or query) document set, while <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is the term frequency in the background document set which by default is the whole index.</p>
<p>Expanding the formula gives us the following which is quadratic in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' />.</p>
<img src='http://s0.wp.com/latex.php?latex=++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' title='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' class='latex' />
<p>By keeping <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> fixed and keeping in mind that both it and <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is positive we get the following function plot. Note that <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is unnaturally large for illustration purposes.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed-300x206.png" alt="JLH-pb-fixed" width="300" height="206" class="alignnone size-medium wp-image-3792"></a></p>
<p>On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.</p>
<p>The gradient of the function:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cnabla+JLH%28p_%7Bfore%7D%2C+p_%7Bback%7D%29+%3D+%5Cleft%28%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D+-+1%7D+%2C+-%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%5E2%7D%5Cright%29++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' title='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' class='latex' />
<p>Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D++-+1+%26+%3C+0+%5C%5C++p_%7Bfore%7D+%26+%3C+%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' title='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' class='latex' />
<p>Furtunately the decreasing part of the function is in an area where <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D+-+p_%7Bback%7D+%3C+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore} - p_{back} &lt; 0' title='p_{fore} - p_{back} &lt; 0' class='latex' /> and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{1}{2}p_{back}' title='\frac{1}{2}p_{back}' class='latex' /> we also see that the entire area where the score is below zero is in this region.</p>
<p>With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH_%7Bmod%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' title='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' class='latex' />
<p>Looking at the level sets for the JLH score there is a quadratic relationship between the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. Solving for a fixed level <img src='http://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> we get:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D%5E2+-+p_%7Bfore%7D+-+k%5Ccdot+p_%7Bback%7D++%3D+0+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Cfrac%7B1%7D%7B2%7D+%5Cpm+%5Cfrac%7B%5Csqrt%7B1+%2B+4+%5Ccdot+k+%5Ccdot+p_%7Bback%7D%7D%7D%7B2%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' class='latex' />
<p>Where the negative part is outside of function definition area.<br />
This is far easier to see in the simplified formula.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Csqrt%7Bk+%5Ccdot+p_%7Bback%7D%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' class='latex' />
<p>An increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> must be offset by approximately a square root increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to  retain the same score.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour-300x209.png" alt="JLH-contour" width="300" height="209" class="alignnone size-medium wp-image-3791"></a></p>
<p>As we see the score increases sharply as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases in a quadratic manner against <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. As <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> becomes small compared to <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> the growth goes from linear in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to squared.</p>
<p>Finally a 3D plot of the score function.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d-300x203.png" alt="JLH-3d" width="300" height="203" class="alignnone size-medium wp-image-3790"></a></p>
<p>So what can we take away from all this? I think the main practical consideration is the squared relationship between <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> which means once there is significant difference between the two the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> will dominate the score ranking. The <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> factor primarily makes the score sensitive when this factor is small and for reasonable similar <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.</p>
<p>The results and visualizations in this blog post is also available as an <a href="https://github.com/andrely/ipython-notebooks/blob/master/JLH%20score%20characteristics.ipynb" title="JLH score characteristics">iPython notebook</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
