Search Nuggets » Christian Rieck

Dynamic search ranking using Elasticsearch, Neo4j and Piwik

Christian Rieck — Wed, 05 Feb 2014 14:49:52 +0000

Getting the correct result at the top of your search results isn’t easy. Anyone working within search quickly realizes this. Tuning the underlying ranking model is a job that just doesn’t end. There is an entire profession about search engine optimization, making sure your site gets as high as possible on Google (and Bing, I guess). If it is not the top result on Google, it is somehow your fault and not Google’s.

Nobody optimizes for an internal enterprise search solution

If your document is not the top result in the internal search solution it is somehow the search engine’s fault, not yours. There is no link cardinality on a file system. All the metadata is wrong and the document your user is trying to find doesn’t even contain the words the user remembers it to contain; the end result being that the target document is not found. As a result of this, trust in the enterprise search diminishes and soon you are left without users. Let’s see how we can use Piwik, neo4j and Elasticsearch to remedy this. (Yes, you can use Solr if you want).

This post is made up of three parts. First I’ll talk about gathering the data necessary. Then we’ll tackle getting the ‘right’ documents at the top of your search and lastly we’ll see if we can expand documents with words your users recalls them by, but are not part of the documents themselves. The journey will be based on the work performed on Comperio’s internal search, at the moment implemented on an old Fast ESP installation.

Gathering data

First you need to know what your users are searching for and what they end up clicking on. We use Piwik, an open source web analytics platform, for this. Seeing the searches, modifications to the searches and if they ended up clicking on anything that they thought was exciting. For a while we only used this for statistics since Piwik offered better insight than the built in query statistics in Fast ESP. Here is an example of one search session:

We see a user entering the site, querying ‘rank order words’ and clicking on a document. Then the same search is executed again. It is reasonable to conclude the clicked document did not contain the wanted information. Lastly ‘boost position term’ is searched. Sadly the session does not end with a click so I guess our search couldn’t deliver. :( [1]

In their current form, the statistics aren’t very useful. But what were to happen if we took these chains of activities and created a graph? We used neo4j for this. A small Java program was written to download the Piwik-history as an XML-file and insert it into a newly created neo4j database.

The nodes are either the start of a session, a search or a document. They are linked by relationships such as CLICKED, SEARCHED, RETURNED_FROM. Since a neo4j database isn’t very screen shot friendly, here is a part of the graph as rendered by Gephi:

We see someone looking for help with Chinese query suggestions. S361 marks the beginning of this session and the first search term was ‘chinese’. They then clicked a link for an internal mail archive before refining their search to ‘chinese als’ and so forth. Links that show when a user back tracked are not shown. That was an isolated little island. The more central documents and search terms at your company will create bigger webs.

Seeing your search history organized like this should give an urge to dive in and explore. It is really interesting, fun and recommended!

Finding popular documents

The simplest way of finding the popular documents is to track search term -> clicks directly. It is also the most common way of doing it. That wouldn’t utilize our fancy new graph now, would it? Since we can do queries against the database let’s get all search sessions of 8 or less actions that resulted in a click on document X:

Page Break(Small disclaimer: As my neo4j skills are very rudimentary there might be more efficient ways of doing this.)

Now we iterate over all sessions and give a score to each search term. The closer it is to the clicked document, the higher score it gets. Sum the score across all sessions. After doing that you get a score indicating how ‘close’ a search term is to any document. This is example data for the single-word search term ‘vpn’:

When the score passes a threshold we add the search-document pair to an Elasticsearch index. For every search executed at our search we first check Elasticsearch to see if the term is boosted. For ‘vpn’ the search logs state

We can see how three documents are boosted for ‘vpn’. (By choice we only boost the top three). Using Fast ESP we wrap the original query with boosts for those specific documents.

In FAST ESP, as well as in Sharepoint Search 2013 the beloved xrank-operator is your friend. In a Lucene based search application use boost queries for this.

The search result returns the popular hits (only one shown here) at the top

The ugly star and cheesy feedback is me trying to tell the users rather bluntly that things happened behind the scene and that their actions will affect future searches. Currently there is no way of giving negative feedback to say ‘no, this is actually not a good hit’. Oh well.

As a bonus all terms that results in boosted documents are, as far as we know, smart things to search for and free of spelling errors. Therefor all such terms are added to a second Elasticsearch index we base our query completion on. (As a side note – if misspelled terms appear often enough to overcome the threshold for them to be taken into account, they could be part of your organization’s tribal language. If the users choose to spell the term “definately” so often that it “makes the cut” then the system should adapt to that. )

Expanding documents to increase recall

Often a user thinks of one document and searches for what, to them, identifies the document. That term might or might not be present in the document itself. If it doesn’t the document is not returned and the user becomes sad. Hopefully they alter their search and continue to look. Should they end up at their document we have the tools needed to remedy the situation. Here is a concrete example:

Here we can see that the node marked 1 might be tagged with ‘sort order refiner entries’ or at least ‘refiner’, a term used twice when trying to find this document. (As an interesting side note, if you observe a lot of ‘sort X’ followed by ‘sort Y’ you might consider adding a synonym between X and Y.) If a term or phrase is used often enough across different sessions we save this to an Elasticsearch index. Each time a document is indexed we look up the document in our index and add any popular search terms to a low ranking field. This guarantees a recall of the document but it will not automatically top the results for those queries. This is a two-step process. If your search engine supports partial updates of documents, go with that.

Before adding the last step we noticed that for some searches we boosted documents that didn’t get recalled and thus were never displayed to the user even though we knew it was a good hit!

Closing words

As a first step towards dynamic ranking this has shown good results. As long as your search engine supports query time boosting you can implement this.

By the way

It should be noted that SharePoint will actually do some of this for you. It comes with an interface meant to be used by an end user (as opposed to all search engines I’ve seen) and the UI contains the event listeners on all links, tracking what you do. This is fed into a database and the data does affect ranking. As far as I know only the last search term before a click is associated with the clicked link.

[1] One scenario that Piwik and click tracking does not pick up is if the sought information is found in the returned teasers. Search sessions that don’t end in a click might in fact have a happy ending.

Making your Fast Search for SharePoint-life easier with ElasticSearch.

Christian Rieck — Tue, 13 Nov 2012 11:26:56 +0000

Last week we were trying to track down a document that, based on customer feedback, must have been lost somewhere between the content source and the index. By turning on the Fast Pipeline stage FFDdumper and doing a full crawl we determined that the document at least had made it to the FAST pipeline. Our third party connector did its job and the blame was either on SharePoint or Fast. The next step was to inspect SharePoint’s crawl logs. They look like this:

In this example I have simply crawled the c: drive on my virtual machine with SharePoint to generate some entries in the crawl log. 269 errors aren’t that many; you could click the link and inspect the log manually in a reasonable amount of time. Viewing the logs will let you inspect 50 documents at a time.

This gets boring really fast. Therefore we started to use the API for extracting crawl logs from SharePoint [3] programmatically. Because of the API’s deprecated status the documentation is a bit on the thin side. Luckily there are some examples in the blogosphere [1,2] that helped us along and we could soon dump the crawl logs to file and ctrl-f for our missing document. Of course, the script could find the log entry and only show you the relevant one. However with a large crawl log it can take hours to extract all the log entries. It is better to dump them all to file should you need to look for a second or third file later.

Using grep, findstr or ctrl-f for searching feels wrong when you work as a search consultant. Instead the natural instinct when something large should be searched through is to index it. One option was to index it back into SharePoint through BDC (and get a possible nasty loop with reading and creating logs) or to use the old Fast API directly, bypassing SharePoint. Both felt a little heavy duty for this one-off job, and that’s when we put ElasticSearch to play. ElasticSearch is a fairly new search engine built on Lucene, and sometimes referred to as “The new kid on the block” in the Open Source Community where there’s some friendly rivalry between the Solr and ElasticSearch camps. It is a 20MB download and is ready to index documents mere seconds after you unzip the package. To index a document you simply POST some JSON to a REST endpoint and that is it. With so many libraries offering JSON-serialization and REST calls, building the world’s simplest connector doesn’t take long. We ported the old PowerShell script to C# and added 30 lines of code. With the removal of file creation code we actually reduced the lines of code overall. The indexing specific part of the script looks like this:

All we do is populate a little DTO, serialize it and push it to ElasticSearch. The indexing takes very little time, in fact fetching the logs from SharePoint is the slowest part of the connector. Notice that the call to index the document isn’t even done asynchronously. After indexing some data and indexing the logs ElasticSearch-head[4] is the easiest way to have a look. The tab “Structured Query” lets you create queries without any prior knowledge of the query syntax. As an example, here I have queried for all failing txt files in the log. (Some parts of the screen shot was dropped to make the image fit the page).

If you have a lot of failures in your SharePoint crawl log it makes sense to start with your most common cause of error. In the API each log entry has a property called ErrorId, although StatusCode would be a more fitting name (In PowerShell it seems to be called ErrorId and in C# MessageId). So in order to get the most common cause for errors one should look at the distribution of ErrorIds. In Fast Search for SharePoint this could be done by placing the ErrorIds in a managed property and configure it with the refiner property, effortlessly producing the distribution of ErrorIds. In ElasticSearch you specify refiners, called facets, at query time. In my limited test case the distribution looks like this:

It is clear that ErrorId 0 is the most common followed by 5 and 748. What the specific ErrorIds mean is unfortunately not documented. The easiest is to query for log entries with an ErrorId and look at the error description. As mentioned ErrorId 0 means OK and 1 means deleted. All failures that happen in the Fast pipeline will be wrapped with ErrorId 11. One of the ErrorIds is simply a statement from SharePoint saying that a crawl rule was honored.
Using this as a dashboard and thinking beyond a sample crawl, you could easily envision this being a tool for checking:

How often and when a document was crawled
Does the crawler experience issues at certain times of the day
What crawl rates are the crawler able to sustain in the long run
What file suffixes are most prone to errors
What top level folders are the most troublesome to index

The code linked below should be seen as proof of concept. You are free to download and it and it will work. Some parameters on the SharePoint API-calls can surely be set better. ElasticSearch provides a bulk-update API that should be used when indexing large datasets. If you are running ElasticSearch on a non-standard port you need to change the code. Also note the complete lack of error handling.

What about that missing file? Turns out it was there all along. The customer was searching for the filename and didn’t recognize the second hit that was displaying his document’s title, not filename. Oh well.

1: http://blogs.msdn.com/b/spses/archive/2011/06/22/exporting-sharepoint-2010-search-crawl-logs.aspx
2: http://blogs.msdn.com/b/russmax/archive/2012/01/28/sharepoint-powershell-script-series-part-5-exporting-the-crawl-log-to-a-csv-file.aspx
3: http://msdn.microsoft.com/en-us/library/ms514229.aspx
4: http://mobz.github.com/elasticsearch-head
5: http://www.elasticsearch.org

The script can be found here: Program.cs