<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; dynamic rank tuning</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/dynamic-rank-tuning/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Dynamic search ranking using Elasticsearch, Neo4j and Piwik</title>
		<link>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/#comments</comments>
		<pubDate>Wed, 05 Feb 2014 14:49:52 +0000</pubDate>
		<dc:creator><![CDATA[Christian Rieck]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[dynamic rank tuning]]></category>
		<category><![CDATA[dynamic search ranking]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[fast]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[Piwik]]></category>
		<category><![CDATA[rank tuning]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[search ranking]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=1957</guid>
		<description><![CDATA[Getting the correct result at the top of your search results isn’t easy. Anyone working within search quickly realizes this. Tuning the underlying ranking model is a job that just doesn’t end. There is an entire profession about search engine optimization, making sure your site gets as high as possible on Google (and Bing, I [...]]]></description>
				<content:encoded><![CDATA[<div>
<p>Getting the correct result at the top of your search results isn’t easy. Anyone working within search quickly realizes this. Tuning the underlying ranking model is a job that just doesn’t end. There is an entire profession about search engine optimization, making sure your site gets as high as possible on Google (and Bing, I guess). If it is not the top result on Google, it is somehow your fault and not Google&#8217;s.<span id="more-1957"></span></p>
</div>
<div>
<p><strong>Nobody optimizes for an internal enterprise search solution</strong></p>
</div>
<div>
<p>If your document is not the top result in the internal search solution it is somehow the search engine&#8217;s fault, not yours. There is no link cardinality on a file system. All the metadata is wrong and the document your user is trying to find doesn’t even contain the words the user remembers it to contain; the end result being that the target document is not found. As a result of this, trust in the enterprise search diminishes and soon you are left without users. Let’s see how we can use <a title="Piwik" href="http://piwik.org">Piwik</a>, <a title="neo4j" href="http://www.neo4j.org">neo4j </a>and <a title="Elasticsearch" href="http://www.elasticsearch.org">Elasticsearch </a>to remedy this. (Yes, you can use <a title="Solr" href="http://lucene.apache.org/solr/">Solr</a> if you want).</p>
</div>
<div>
<p>This post is made up of three parts. First I’ll talk about gathering the data necessary. Then we’ll tackle getting the ‘right’ documents at the top of your search and lastly we’ll see if we can expand documents with words your users recalls  them by, but are not part of the documents themselves. The journey will be based on the work performed on Comperio’s internal search, at the moment implemented on an old Fast ESP installation.</p>
</div>
<div>
<p><strong>Gathering data</strong></p>
</div>
<div>
<p>First you need to know what your users are searching for and what they end up clicking on. We use Piwik, an open source web analytics platform, for this. Seeing the searches, modifications to the searches and if they ended up clicking on anything that they thought was exciting. For a while we only used this for statistics since Piwik offered better insight than the built in query statistics in Fast ESP. Here is an example of one search session:</p>
</div>
<div></div>
<div>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/piwik.png"><img class="alignnone size-full wp-image-1968" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/piwik.png" alt="" width="411" height="179" /></a></p>
<p>We see a user entering the site, querying ‘rank order words’ and clicking on a document. Then the same search is executed again. It is reasonable to conclude the clicked document did not contain the wanted information. Lastly ‘boost position term’ is searched. Sadly the session does not end with a click so I guess our search couldn’t deliver. :( [1]</p>
</div>
<div>
<p>In their current form, the statistics aren’t very useful. But what were to happen if we took these chains of activities and created a graph? We used neo4j for this. A small Java program was written to download the Piwik-history as an XML-file and insert it into a newly created neo4j database.</p>
</div>
<div>
<p>The nodes are either the start of a session, a search or a document. They are linked by relationships such as CLICKED, SEARCHED, RETURNED_FROM.  Since a neo4j database isn’t very screen shot friendly, here is a part of the graph as rendered by <a title="Gephi" href="https://gephi.org">Gephi</a>:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/chinese.png"><img class="alignnone size-full wp-image-1964" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/chinese.png" alt="" width="411" height="141" /></a></p>
</div>
<div></div>
<div>
<p>We see someone looking for help with Chinese query suggestions. S361 marks the beginning of this session and the first search term was ‘chinese’. They then clicked a link for an internal mail archive before refining their search to ‘chinese als’ and so forth. Links that show when a user back tracked are not shown. That was an isolated little island. The more central documents and search terms at your company will create bigger webs.</p>
</div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/web.png"><img class="alignnone size-full wp-image-1971" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/web.png" alt="" width="605" height="368" /></a></p>
</div>
<div></div>
<div>
<p>Seeing your search history organized like this should give an urge to dive in and explore. It is really interesting, fun and recommended!</p>
</div>
<div>
<p><strong>Finding popular documents</strong></p>
</div>
<div>
<p>The simplest way of finding the popular documents is to track search term -&gt; clicks directly. It is also the most common way of doing it. That wouldn’t utilize our fancy new graph now, would it? Since we can do queries against the database let’s get all search sessions of 8 or less actions that resulted in a click on document X:</p>
<p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/query.png"><img class="alignnone size-full wp-image-1969" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/query.png" alt="" width="605" height="40" /></a></p>
</div>
<div></div>
<div>
<p>Page Break(Small disclaimer: As my neo4j skills are very rudimentary there might be more efficient ways of doing this.)</p>
</div>
<div>
<p>Now we iterate over all sessions and give a score to each search term. The closer it is to the clicked document, the higher score it gets. Sum the score across all sessions.  After doing that you get a score indicating how ‘close’ a search term is to any document. This is example data for the single-word search term ‘vpn’:</p>
<p>&nbsp;</p>
</div>
<div><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/json.png"><img class="alignnone size-full wp-image-1966" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/json.png" alt="" width="605" height="66" /></a></div>
<div>
<p>&nbsp;</p>
<p>When the score passes a threshold we add the search-document pair to an Elasticsearch index. For every search executed at our search we first check Elasticsearch to see if the term is boosted. For ‘vpn’ the search logs state</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/log.png"><img class="alignnone size-full wp-image-1967" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/log.png" alt="" width="605" height="50" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>We can see how three documents are boosted for ‘vpn’. (By choice we only boost the top three). Using Fast ESP we wrap the original query with boosts for those specific documents.</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/fql.png"><img class="alignnone size-full wp-image-1965" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/fql.png" alt="" width="546" height="179" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>In FAST ESP, as well as in Sharepoint Search 2013 the beloved xrank-operator is your friend. In a Lucene based search application use boost queries for this.</p>
</div>
<div>
<p>The search result returns the popular hits (only one shown here) at the top</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/topdoc.png"><img class="alignnone size-full wp-image-1970" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/topdoc.png" alt="" width="605" height="105" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>The ugly star and cheesy feedback is me trying to tell the users rather bluntly that things happened behind the scene and that their actions will affect future searches. Currently there is no way of giving negative feedback to say ‘no, this is actually not a good hit’. Oh well.</p>
</div>
<div>
<p>As a bonus all terms that results in boosted documents are, as far as we know, smart things to search for and free of spelling errors. Therefor all such terms are added to a second Elasticsearch index we base our query completion on. (As a side note – if misspelled terms appear often enough to overcome the threshold for them to be taken into account, they could be part of your organization’s tribal language. If the users choose to spell the term “definately” so often that it “makes the cut” then the system should adapt to that. )</p>
</div>
<div>
<p><strong>Expanding documents to increase recall</strong></p>
</div>
<div>
<p>Often a user thinks of one document and searches for what, to them, identifies the document. That term might or might not be present in the document itself. If it doesn’t the document is not returned and the user becomes sad. Hopefully they alter their search and continue to look. Should they end up at their document we have the tools needed to remedy the situation. Here is a concrete example:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/arch.png"><img class="alignnone size-full wp-image-1963" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/arch.png" alt="" width="594" height="217" /></a></p>
</div>
<div></div>
<div>
<p>Here we can see that the node marked 1 might be tagged with ‘sort order refiner entries’ or at least ‘refiner’, a term used twice when trying to find this document. (As an interesting side note, if you observe a lot of ‘sort X’ followed by ‘sort Y’ you might consider adding a synonym between X and Y.) If a term or phrase is used often enough across different sessions we save this to an Elasticsearch index. Each time a document is indexed we look up the document in our index and add any popular search terms to a low ranking field. This guarantees a recall of the document but it will not automatically top the results for those queries. This is a two-step process. If your search engine supports partial updates of documents, go with that.</p>
</div>
<div>
<p>Before adding the last step we noticed that for some searches we boosted documents that didn’t get recalled and thus were never displayed to the user even though we knew it was a good hit!</p>
</div>
<div>
<p><strong>Closing words</strong></p>
</div>
<div>
<p>As a first step towards dynamic ranking this has shown good results. As long as your search engine supports query time boosting you can implement this.</p>
</div>
<div>
<p><strong>By the way</strong></p>
</div>
<div>
<p>It should be noted that SharePoint will actually do some of this for you. It comes with an interface meant to be used by an end user (as opposed to all search engines I’ve seen) and the UI contains the event listeners on all links, tracking what you do. This is fed into a database and the data does affect ranking. As far as I know only the last search term before a click is associated with the clicked link.</p>
</div>
<div>
<p>[1] One scenario that Piwik and click tracking does not pick up is if the sought information is found in the returned teasers. Search sessions that don’t end in a click might in fact have a happy ending.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
