Search Nuggets » aggregations

How Elasticsearch calculates significant terms

André Lynum — Wed, 10 Jun 2015 11:02:28 +0000

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

0 \\ 0 & elsewhere \end{matrix}\right. ' title=' JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right. ' class='latex' />

Here the is the frequency of the term in the foreground (or query) document set, while is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in .

By keeping fixed and keeping in mind that both it and is positive we get the following function plot. Note that is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when and approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

Furtunately the decreasing part of the function is in an area where and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as increases.

Looking at the level sets for the JLH score there is a quadratic relationship between the and . Solving for a fixed level we get:

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

An increase in must be offset by approximately a square root increase in to retain the same score.

As we see the score increases sharply as increases in a quadratic manner against . As becomes small compared to the growth goes from linear in to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between and which means once there is significant difference between the two the will dominate the score ranking. The factor primarily makes the score sensitive when this factor is small and for reasonable similar the decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.

Elastic{ON}15: Day two

Christoffer Vig — Thu, 19 Mar 2015 20:59:41 +0000

March 11, 2015

Keynote

Fighting the crowds to find a seat for the keynote at Day 2 at elastic{ON}15 we were blocked by a USB stick with the curious caption Microsoft (heart) Linux. Things have certainly changed.

Microsoft

The keynote, led by Elastic SVP of sales Aaron Katz, included Pablo Castro of Microsoft who was keen to explain how this probably isn’t so far from the truth. Elasticsearch is used internally in several Microsoft products among Linux and other open source software and this is a huge change from the Microsoft we know from around five years ago. Pablo revealed some details towards how elasticsearch is used as data storage and search platform in MSN, Microsoft Dynamics and Azure Search. Microsoft truly has gone through some fundamental changes lately embracing open source both internally and externally. We see this as a demonstration of the power of open source and the huge value of Elastic(search) brings to many organizations. As Jordan Sissel said in the keynote yesterday “If a user has a problem, it is a bug”. This is a philosophical stance towards a conception of software as an enabler of creativity and growth, in contrast to viewing software as a fixed product packaged for sale.

Goldman Sachs

Microsofts contribution was in the middle part of the keynote. The first part was a discussion with Don Duet, managing director of Goldman Sachs. Goldman Sachs provides financial services on a global scale, and has been on the forefront of technology since its inception in 1869. They were an early adopter of Elasticsearch since it was as an easy to use search and analytics tool for big data. Goldman Sachs is now using elasticsearch extensively as a key part of their technological stack.

NASA

The most mind blowing part of the keynote was the last one held by two chaps from the Jet Propulsion Labs team at NASA, Ricky Ma and Don Isla. They first showed their awesome internal search with previews, and built in rank tuning. Then they talked about the Mars Curiosity rover, a robot planted on Mars which runs around taking samples and selfies. It constantly sends data back to earth where the JPL team analyzes the operations of the rover. Elasticsearch is naturally at the center of this interplanetary operation, nothing less.

Mars Curiosity Rover Selfie

The remainder of the day contained sessions across the same three tracks as the first day. In addition five tracks of birds of a feather or “lounge” sessions were held where people gathered in smaller groups to discuss various topics. Needless to say the breadth of the program meant we were stretched thin. We chose to focus on three topics that are of particular importance to our customers: aggregations, security & Shield, and resiliency

More aggregations

Adrien Grand & Colin Goodheart-Smithe did a deep dive into the details of aggregations and how they are computed. In particular how to tune them and the results in terms of execution complexity. A key point is the approximations that are employed to compute some of the aggregations which involve certain trade offs in speed over accuracy. Aggregations are a very powerful feature requiring some some planning to be feasible and efficient.

Security/Shield

Uri Boness talked about Shield and the current state of authentication & authorization, He provided some pointers to what is on the roadmap for the coming releases. Unfortunately, there does not appear to be any concrete plans for providing built in document level security. This is a sought after feature that would certainly make the product more interesting in many enterprise settings. Then again, there are companies who provide connector frameworks that include security solutions for elasticsearch. We had a chat with some of them at the conference, including Enonic, SearchBlox and Search Technologies.

Facebook

Peter Vulgaris from Facebook explained how they are using elasticsearch. To me, the story resembled Microsoft’s. Facebook has heaps of data, and lots of use cases for it. Once they started to use elasticsearch it was widely adopted in the company and the amount of data indexed grew ever larger which forced them to think more closely about how they manage their clusters.

Resiliency

Elasticsearch is a distributed system, and as such shares the same potential issues as other distributed systems. Boaz Leskes & Igor Motov explained the measures that have been undertaken in order to avoid problems such as “split-brain” syndrome. This is when a cluster is confused about what node should be considered the master. Data safety and security are important features of Elasticsearch and there is a continuous effort in place in these areas.

Lucene

Stepping back to day 1 and the Lucene session featuring the mighty Robert Muir, we learned that Lucene version 5 includes a lot of improvements. Especially performance wise regarding compression both on indexing and query times which enables faster execution times and reduces resource consumption. There has also been made efforts to the Lucene core enabling a merging of query and filter as two sides of the same coin. After all a query is just a filter with a relevance score. On another note Lucene will now handle caching of queries by itself.

Wrapping it up

Elastic{ON}15 stands as a confirmation of the attitude that were essential in the creation of the elasticsearch project. The visions that guided the early development are still valid today, except the scale is larger. The recent emphasis on stability, security and resiliency will welcome a new wave of users and developers.

At the same time there is a continuous exploration and development into big data related analytics but with the speed and agility we have come to expect from Elasticsearch.

Thanks for this year, looking forwards to next!

Elastic{ON}15: Day one

Christoffer Vig — Wed, 11 Mar 2015 16:07:48 +0000

March 10, 2015

At Comperio we have been speculating for a while now that Elasticsearch might just drop search from their name. With Elasticsearch spearheading the expansion of search into analytics and all sorts of content and data driven applications such a change made sense to us. What the name would be we had no idea about however – ElasticStash, KibanElastic StashElasticLog – none of these really rolled of the tongue like a proper brand.

More surprising is the Elasticsearch move into the cloud space by acquiring Found. A big and heartfelt congratulations to our Norwegian colleagues from us at Comperio. Found has built and delivered an innovative and solid product and we look forward to seeing them build something even better as a part of Elastic.

Elasticsearch is renamed to Elastic, and Found is no longer just Found, but Found by Elastic. The opening keynote held by CEO Steven Shuurman and Shay Banon was a tour of triumph through the history of Elastic, detailing how the company has grown sort of in an organic, natural manner, into what it is today. Kibana and Logstash started as separate projects but were soon integrated into Elastic. Shay and Steven explained how old roadmaps for the development of Elastic included plans to create CloudES, search as a cloud service. CloudES was never created, due to all the other pressing issues. Simultaneously, the Norwegian company Found made great strides with their cloud search offering, and an acquisition became a very natural fit.

Elastic{ON} is the first conference devoted entirely to the Elastic family of products. The sessions consist on one hand of presentations by developers and employees of Elastic, on the other there is “ELK in the wild” showcasing customer use cases, including Verizon, Github, Facebook and more.

On day one the sessions about core elasticsearch, Lucene, Kibana and Logstash were of particular interest to us.

Elasticsearch

The session about “Recent developments in elasticsearch 2.0” held by Clinton Gormley and Simon Wilnauer revealed a host of interesting new features in the upcoming 2.0 release. There is a very high focus on stability, and making sure that no releases should contain bugs. To illustrate this Clinton showed graphs relating the number of lines of code compared to lines of tests, where the latter was rising sharply in the latest releases. It was also interesting to note that the number of lines of code has been reduced recently due to refactoring and other improvements to the code base.

Among interesting new features are a new “reducer” step for aggregations allowing calculations to be done on top of aggregated results and a Changes API which helps managing changes to the index. The Changes API will be central in creating other features, for example update by query, where a typical use case involves logging search results, where the changes API will allow updating information about click activity in the same log entry as the one containing the query.

There will also be a Reindex API that simplifies the development cycle when you have to refeed an entire index because you need to change a mapping or field type.

Kibana

Rashid Khan went through the motivations behind the development of Kibana 4, where support for aggregations, and making the product easier to work with and extendable really makes this into a fitting platform for creating tools for creating visualizations of data. Followed by “The Contributor’s Guide to the Kibana Galaxy” by Spencer Alger who demoed how to setup the development environment for Kibana 4 using using npm, grunt and bower- the web development standard toolset of today ( or was it yesterday?)

Logstash

Logstash creator Jordan Sissel presented the new features of Logstash 1.5, and what to expect in future versions. 1.5 introduces a new plugin system, and to great relief of all windows users out there the issues regarding file locking on rolling log files have been resolved! The roadmap also aims to vastly improve the reliability of Logstash, no more losing documents in planned or unplanned outages. In addition there are plans to add event persistence and various API management tools. As a consequence of the river technology being deprecated, Logstash will take the role as document processing framework that those of us who come from FAST ESP have missed for some time now. So in effect, all rivers, (including JDBC) will be ported to Logstash.

Aggregations

Mark Harwood presented a novel take on optimizing index creation for aggregations in the session “Building Entity Centric Indexes”. You may have tried to run some fancy aggregations,only to have elasticsearch dying from out of memory errors. Avoiding this often takes some insight into the architecture to
structure your aggregations in the best possible manner. Mark essentially showed how to move some of the aggregation to indexing time rather than query time. The original use case was a customer who needed to know what is the average session length for the users of his website. Figuring that out involved running through the whole index, sorting by session id, picking the timestamp of the first item and subtracting from the second, a lot of operations with an enormous consumption of resources. Mark approaches the problems in a creative and mathematical manner, and it is always inspiring to attend his presentations. It will be interesting to see whether the Changes API mentioned above will deliver functionality that can be used to improve aggregated data.

.NET

Deep dive into the .NET clients with Martijn Laarman showed how to use a strongly typed language as C# with elasticsearch. Yes, it is actually possible, and it looked very good. There is a low-level client that just connects to the api where you have to to do all the parsing yourself, and a high-level client called NEST building on top of that offering a strongly typed query DSL having almost 1 to 1 mapping to the elasticsearch dsl. Particularly nifty was the covariant result handling, where you can specify the type of results you need back, considering a search result from elasticsearch can contain many types.

Looking forwards to day 2!