<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; Elasticsearch</title>
	<atom:link href="http://blog.comperiosearch.com/blog/category/elasticsearch/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Experimenting with Open Source Web Crawlers</title>
		<link>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/</link>
		<comments>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/#comments</comments>
		<pubDate>Fri, 29 Apr 2016 11:03:42 +0000</pubDate>
		<dc:creator><![CDATA[Mridu Agarwal]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OpenWebSpider]]></category>
		<category><![CDATA[Scrapy]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=4080</guid>
		<description><![CDATA[Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site,  web scraping has many uses. In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not [...]]]></description>
				<content:encoded><![CDATA[<p lang="en-US">Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site,  web scraping has many uses.</p>
<p lang="en-US">In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not only easily available but quite powerful as well. In this article I am mostly going to cover their basic features and how easy they are to start with.</p>
<p lang="en-US">If you are like one of those persons who likes to quickly get started while learning something, I would suggest that you try <a href="http://www.openwebspider.org/">OpenWebSpider</a> first.</p>
<p lang="en-US">It is a simple web browser based open source crawler and search engine which is simple to install and use and is very good for those who are trying to get acquainted to web crawling . It stores webpages in MySql or MongoDb. I used MySql for my testing purpose. You can follow the steps <a href="http://www.openwebspider.org/documentation/openwebspider-js/">here</a> to install it. It&#8217;s pretty simple and basic.</p>
<p lang="en-US">So, once you have installed everything , you just need to open a web-browser at <a href="http://127.0.0.1:9999/">http://127.0.0.1:9999/</a> and you are ready to crawl and search. Just check your database settings, type the Url of the site you want to crawl and within couple of minutes, you have all the data you need. You can even search it going to the search tab and typing in your query. Whoa! That was quick and compact and needless to say you don’t need any programming skills to crawl it.</p>
<p lang="en-US">If you are trying to create an off-line copy of your data or your very own mini Wikipedia, I think go for this as it’s the easiest way to do it.</p>
<p lang="en-US">Following are some screen shots:</p>
<p lang="en-US"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS1.png"><img class="alignleft wp-image-4083 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS1.png" alt="OpenWebSpider" width="613" height="438" /></a></p>
<p lang="en-US"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS2.png"><img class="alignleft wp-image-4086 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS2.png" alt="OpenSearchWeb" width="611" height="441" /></a></p>
<p lang="en-US" style="text-align: left"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS3.png"><img class="alignleft size-full wp-image-4087" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS3.png" alt="OpenSearchWeb" width="611" height="441" /></a></p>
<p lang="en-US" style="text-align: left">You can also see the this Search engine demo <a href="http://lab.openwebspider.org/search_engine/">here</a>, before actually getting started.</p>
<p lang="en-US" style="text-align: left">Ok, after getting my hands on into web crawling, I was curious to do  more sophisticated stuff like extracting topics from a web site where I do not have any RSS feed or API. Extracting this structured data could be quite important to many business scenarios where you are trying to follow competitor&#8217;s product news or gather data for business intelligence. I decided to use <a href="http://scrapy.org/">Scrapy</a> for this experiment.</p>
<p lang="en-US" style="text-align: left">The good thing about Scrapy is that it is not only fast and simple, but very extensible as well. While installing it on my windows environment, I had few hiccups mainly because of the different compatible version of python but in the end, once you get it, it&#8217;s very simple(Isn&#8217;t that how you feel anyways , once things works ? Anyways, forget it! :D). Follow these links, if you are having trouble installing Scrapy like me:</p>
<p lang="en-US" style="text-align: left"><a href="https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment">https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment</a></p>
<p lang="en-US" style="text-align: left"><a href="http://doc.scrapy.org/en/latest/intro/install.html#intro-install">http://doc.scrapy.org/en/latest/intro/install.html#intro-install</a></p>
<p lang="en-US" style="text-align: left">After installing, you need to create a Scrapy project. Since we are doing more customized stuff than just crawling the entire website, this requires more effort and knowledge of programming skills and sometime browser tools to understand the HTML DOM. You can follow <a href="http://doc.scrapy.org/en/latest/intro/overview.html">this</a> link to get started with you first Scrapy project .Once you have crawled the data that you need, it would be interesting to feed this data into a search engine. I have also been looking for open source web crawlers for Elastic Search and this looked like the perfect opportunity. Scrapy provides integration with Elastic Search out of the box , which is awesome. You just need to install the Elastic Search module for Scrapy(of course Elastic Search should be running somewhere) and configure the Item Pipeline for Scrapy. Follow <a href="http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html">this</a> link for the step by step guide. Once done, you have the fully integrated crawler and search system!</p>
<p lang="en-US" style="text-align: left">I crawled <a href="http://primehealthchannel.com">http://primehealthchannel.com</a> and created an index named &#8220;healthitems&#8221; in Scrapy.</p>
<p lang="en-US" style="text-align: left">To search the elastic search index, I am using Chrome extension <span style="font-weight: bold">Sense</span> to send queries to Elastic Search, and this is how it looks</p>
<p lang="en-US" style="text-align: left">GET /scrapy/healthitems/_search</p>
<p style="text-align: left"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/ES1.png"><img class="alignleft wp-image-4082 size-large" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/ES1-1024x597.png" alt="Elastic Search" width="1024" height="597" /></a></p>
<p lang="en-US" style="text-align: left">I hope you had fun reading this and now wants to try some of your own cool ideas . Do let us know how you used it and which crawler you like the most!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analysing Solr logs with Logstash</title>
		<link>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/#comments</comments>
		<pubDate>Sun, 20 Sep 2015 22:00:00 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[grok]]></category>
		<category><![CDATA[logs]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3934</guid>
		<description><![CDATA[Analysing Solr logs with Logstash Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you&#8217;re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK [...]]]></description>
				<content:encoded><![CDATA[<h1>Analysing Solr logs with Logstash</h1>
<p>Although I usually write about and work with <a href="http://lucene.apache.org/solr/">Apache Solr</a>, I also use the <a href="https://www.elastic.co/downloads">ELK stack</a> on a daily basis on a number of <a>projects.</a> If you&#8217;re not familiar with Solr, take a look at some of my <a href="http://blog.comperiosearch.com/blog/author/sebm/">previous posts</a>. If you need some more background info on the ELK stack, both <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christoffer</a> and <a href="http://blog.comperiosearch.com/blog/author/alynum/">André</a> have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.</p>
<p>As a little side note for the truly devoted Solr users, an ELK stack alternative exists with <a href="http://lucidworks.com/fusion/silk/">SiLK</a>. I highly recommend checking out Lucidworks&#8217; various blog posts on <a href="http://lucidworks.com/blog/">Solr and search in general</a>.</p>
<h2>Some background</h2>
<p>On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio&#8217;s search middleware application.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs.png"><img class="aligncenter size-medium wp-image-3942" src="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs-300x157.png" alt="Search Logs Dashboard" width="300" height="157" /></a><br />
Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.</p>
<h2>Lets get started</h2>
<p>I&#8217;m going to assume you already have a running Solr installation. You will, however, need to download <a href="https://www.elastic.co/products/elasticsearch">Elasticsearch</a> and <a href="https://www.elastic.co/products/logstash">Logstash</a> and unpack them. Before we start Elasticsearch, I recommend installing these plugins:</p>
<ul>
<li><a href="http://mobz.github.io/elasticsearch-head/">Head</a></li>
<li><a href="https://www.elastic.co/guide/en/marvel/current/_installation.html">Marvel</a></li>
</ul>
<p>Head is a cluster health monitoring tool. Marvel we&#8217;ll only need for the bundled developer console, Sense. To disable Marvel&#8217;s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml</p><pre class="crayon-plain-tag">marvel.agent.enabled: false</pre><p>Start elasticsearch with this command:</p><pre class="crayon-plain-tag">~/elasticsearch-[version]/bin/elasticsearch</pre><p>Navigate to <a href="http://localhost:9200/">http://localhost:9200/</a> to confirm that Elasticsearch is running. Check <a href="http://localhost:9200/_plugin/head">http://localhost:9200/_plugin/head</a> and <a href="http://localhost:9200/_plugin/marvel/sense/index.html">http://localhost:9200/_plugin/marvel/sense/index.html</a> to verify the plugins installed correctly.</p>
<h2>The anatomy of a Logstash config</h2>
<hr />
<h3>Update 21/09/15</h3>
<p>I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: <script src="https://gist.github.com/41ca2c34c50d0d9d8e82.js?file=solr-filter.conf"></script>The rest of the original article contents are unchanged for comparison&#8217;s sake.</p>
<hr />
<p>All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:</p><pre class="crayon-plain-tag">input {
  file {
    path =&gt; "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host =&gt; "localhost"
    template =&gt; "~/logstash/bin/logstash_solr_template.json"
    index =&gt; "solr-%{+YYYY.MM.dd}"
    template_overwrite =&gt; true
  }</pre><p>This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the <a href="https://www.elastic.co/guide/en/logstash/current/input-plugins.html">input</a> and <a href="https://www.elastic.co/guide/en/logstash/current/output-plugins.html">output</a> plugins&#8217; documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-template-json">here</a>.</p>
<p>To process the Solr logs, we&#8217;ll use the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html">grok</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html">mutate</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-multiline.html">multiline</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-drop.html">drop</a> and <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html">kv</a> filter plugins.</p>
<ul>
<li>Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the <a href="https://grokdebug.herokuapp.com/">grok debugger app</a> is particularly helpful. Be mindful though that some of the escaping syntax isn&#8217;t always the same in the app as what the Logstash config expects.</li>
<li>We need the multiline plugin to link stacktraces to their initial error message.</li>
<li>The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events</li>
<li>We use mutate to add and remove tags along the way.</li>
<li>And finally, drop to drop any events we don&#8217;t want to keep.</li>
</ul>
<p>&nbsp;</p>
<h2>The <del>hard</del> fun part</h2>
<p>Lets dive into the filter stage now. Take a look at the <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-logstash-conf">config file</a> I&#8217;m using. The Grok patterns may appear a bit daunting, especially if you&#8217;re not very familiar with regexp and the default Grok patterns, but don&#8217;t worry! Lets break it down.</p>
<p>The first section extracts the log event&#8217;s severity and timestamp into their own fields, &#8216;level&#8217; and &#8216;LogTime&#8217;:</p><pre class="crayon-plain-tag">grok {
    match =&gt; { "message" =&gt; "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure =&gt; []
  }</pre><p>So, given this line from my <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-log">example log file</a></p><pre class="crayon-plain-tag">INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&amp;literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.id=epifile_211278&amp;literal.epifileid_s=211278&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;literal.filesource_s=SiteFile} {} 0 65</pre><p>We&#8217;d extract</p><pre class="crayon-plain-tag">{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}</pre><p>In the template file I linked earlier, you&#8217;ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we&#8217;d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you&#8217;ll notice I use</p><pre class="crayon-plain-tag">tag_on_failure =&gt; []</pre><p>in most of my Grok stages. The default value is &#8220;_grokparsefailure&#8221;, which I don&#8217;t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.</p>
<p>The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.</p><pre class="crayon-plain-tag"># Combine commit events into single message
  multiline {
      pattern =&gt; "^\t(commit\{)"
      what =&gt; "previous"
    }</pre><p>Now we come to a major section for handling general INFO level messages.</p><pre class="crayon-plain-tag"># INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure =&gt; []  
    }
    if [params] {
      kv {
        field_split =&gt; "&amp;"
        source =&gt; "params"
      }
    } else {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure =&gt; [ "drop" ]
        add_field =&gt; {
          "action" =&gt; "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }</pre><p>This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document&#8217;s extracted contents look like when stored in Elasticsearch:</p><pre class="crayon-plain-tag">{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&amp;literal.epi_file_title=A05100_Tass5+Trondheim.pdf&amp;literal.title=A05100_Tass5+Trondheim.pdf&amp;literal.id=epifile_211027&amp;literal.epifileid_s=211027&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}</pre><p>If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with &#8220;drop&#8221;. Finally, any messages tagged with &#8220;drop&#8221; are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:</p><pre class="crayon-plain-tag">.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}</pre><p>The next section handles ERROR level messages:</p><pre class="crayon-plain-tag"># Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern =&gt; "^\s"
      what =&gt; "previous"
      add_tag =&gt; [ "multiline_pre" ]
    }
    multiline {
        pattern =&gt; "^Caused by"
        what =&gt; "previous"
        add_tag =&gt; [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure =&gt; []
      }
    }
  }</pre><p>Given a stack trace, there&#8217;s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.</p>
<p>Finally, I drop any empty lines and clean up temporary tags:</p><pre class="crayon-plain-tag"># Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag =&gt; [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }</pre><p>To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:</p><pre class="crayon-plain-tag"># aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}</pre><p>You should get back all your processed log events along with an aggregation on event severity.</p>
<h2>Conclusion</h2>
<p>Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I&#8217;ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How Elasticsearch calculates significant terms</title>
		<link>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/#comments</comments>
		<pubDate>Wed, 10 Jun 2015 11:02:28 +0000</pubDate>
		<dc:creator><![CDATA[André Lynum]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[aggregations]]></category>
		<category><![CDATA[lexical analysis]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[significant terms]]></category>
		<category><![CDATA[word analysis]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3785</guid>
		<description><![CDATA[Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative [...]]]></description>
				<content:encoded><![CDATA[<div id="attachment_3823" style="width: 310px" class="wp-caption aligncenter"><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon-300x187.png" alt="The &quot;unvommonly common&quot;" width="300" height="187" class="size-medium wp-image-3823" /></a><p class="wp-caption-text">The magic of the &#8220;uncommonly common&#8221;.</p></div>
<p>Many of you who use Elasticsearch may have used the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html" title="significant terms">significant terms aggregation</a> and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in &#8211; garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.</p>
<p>The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH+%3D+%5Cleft%5C%7B%5Cbegin%7Bmatrix%7D++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%26+p_%7Bfore%7D+-+p_%7Bback%7D+%3E+0+%5C%5C++0++%26+elsewhere++%5Cend%7Bmatrix%7D%5Cright.++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' title='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' class='latex' />
<p>Here the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is the frequency of the term in the foreground (or query) document set, while <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is the term frequency in the background document set which by default is the whole index.</p>
<p>Expanding the formula gives us the following which is quadratic in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' />.</p>
<img src='http://s0.wp.com/latex.php?latex=++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' title='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' class='latex' />
<p>By keeping <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> fixed and keeping in mind that both it and <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is positive we get the following function plot. Note that <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is unnaturally large for illustration purposes.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed-300x206.png" alt="JLH-pb-fixed" width="300" height="206" class="alignnone size-medium wp-image-3792"></a></p>
<p>On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.</p>
<p>The gradient of the function:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cnabla+JLH%28p_%7Bfore%7D%2C+p_%7Bback%7D%29+%3D+%5Cleft%28%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D+-+1%7D+%2C+-%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%5E2%7D%5Cright%29++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' title='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' class='latex' />
<p>Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D++-+1+%26+%3C+0+%5C%5C++p_%7Bfore%7D+%26+%3C+%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' title='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' class='latex' />
<p>Furtunately the decreasing part of the function is in an area where <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D+-+p_%7Bback%7D+%3C+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore} - p_{back} &lt; 0' title='p_{fore} - p_{back} &lt; 0' class='latex' /> and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{1}{2}p_{back}' title='\frac{1}{2}p_{back}' class='latex' /> we also see that the entire area where the score is below zero is in this region.</p>
<p>With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH_%7Bmod%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' title='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' class='latex' />
<p>Looking at the level sets for the JLH score there is a quadratic relationship between the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. Solving for a fixed level <img src='http://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> we get:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D%5E2+-+p_%7Bfore%7D+-+k%5Ccdot+p_%7Bback%7D++%3D+0+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Cfrac%7B1%7D%7B2%7D+%5Cpm+%5Cfrac%7B%5Csqrt%7B1+%2B+4+%5Ccdot+k+%5Ccdot+p_%7Bback%7D%7D%7D%7B2%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' class='latex' />
<p>Where the negative part is outside of function definition area.<br />
This is far easier to see in the simplified formula.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Csqrt%7Bk+%5Ccdot+p_%7Bback%7D%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' class='latex' />
<p>An increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> must be offset by approximately a square root increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to  retain the same score.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour-300x209.png" alt="JLH-contour" width="300" height="209" class="alignnone size-medium wp-image-3791"></a></p>
<p>As we see the score increases sharply as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases in a quadratic manner against <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. As <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> becomes small compared to <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> the growth goes from linear in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to squared.</p>
<p>Finally a 3D plot of the score function.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d-300x203.png" alt="JLH-3d" width="300" height="203" class="alignnone size-medium wp-image-3790"></a></p>
<p>So what can we take away from all this? I think the main practical consideration is the squared relationship between <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> which means once there is significant difference between the two the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> will dominate the score ranking. The <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> factor primarily makes the score sensitive when this factor is small and for reasonable similar <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.</p>
<p>The results and visualizations in this blog post is also available as an <a href="https://github.com/andrely/ipython-notebooks/blob/master/JLH%20score%20characteristics.ipynb" title="JLH score characteristics">iPython notebook</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Analyzing web server logs with Elasticsearch in the cloud</title>
		<link>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/#comments</comments>
		<pubDate>Tue, 26 May 2015 21:12:34 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[found by elastic]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3702</guid>
		<description><![CDATA[Using Logstash and Kibana on Found by Elastic, Part 1 This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and [...]]]></description>
				<content:encoded><![CDATA[<h2>Using Logstash and Kibana on Found by Elastic, Part 1</h2>
<p>This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and securing connections. Part 2 will show how to configure Logstash to read from IIS log files, and how to use Kibana 4 to visualize web traffic. Originally published on the <a href="https://www.found.no/foundation/analyzing-weblogs-with-elasticsearch/">Elastic Blog</a><br />
<span id="more-3702"></span></p>
<h4>Getting the Bits</h4>
<p>For this demo I will be running Logstash and Kibana from my Windows laptop.<br />
If you want to follow along, download and extract Logstash 1.5.RC4 or later, and Kibana 4.0.2 or later from <a href="https://www.elastic.co/downloads">https://www.elastic.co/downloads</a>.</p>
<h4>Creating an Elasticsearch Cluster</h4>
<p>Creating a new trial cluster in Found is just a matter of logging in and pressing a button. It takes a few seconds until the cluster is ready, and a screen with some basic information on how to connect pops up. We need the address for the HTTPS endpoint, so copy that out.</p>
<h4>Configuring Logstash</h4>
<p>Now, with the brand new SSL connection option in Logstash, connecting to Found is as simple as this Logstash configuration</p><pre class="crayon-plain-tag">input { stdin{} }

output {
  elasticsearch {
    protocol =&gt; http
    host =&gt; REPLACE_WITH_FOUND_CLUSTER_HOSTNAME
    port =&gt; "9243" # Check the port also
    ssl =&gt; true
  }

  stdout { codec =&gt; rubydebug }
}</pre><p>&nbsp;</p>
<p>Save the file as found.conf</p>
<p>Start up Logstash using</p><pre class="crayon-plain-tag">bin\logstash.bat agent --verbose -f found.conf</pre><p>You should see a message similar to</p><pre class="crayon-plain-tag">Create client to elasticsearch server on `https://....foundcluster.com:9243`: {:level=&amp;gt;:info}</pre><p>Once you see &#8220;Logstash startup completed&#8221; type in your favorite test term on the terminal. Mine is &#8220;fisk&#8221; so I type that.<br />
You should see output on your screen showing what Logstash intends to pass on to elasticsearch.</p>
<p>We want to make sure this actually hits the cloud, so open a browser window and paste the HTTPS link from before, append <code>/_search</code> to the URL and hit enter.<br />
You should now see the search results from your newly created Elasticsearch cluster, containing the favorite term you just typed in. We have a functioning connection from Logstash on our machine to Elasticsearch in the cloud! Congratulations!</p>
<h4>Configuring Kibana 4</h4>
<p>Kibana 4 comes with a built-in webserver. The configuration is done in a kibana.yml file in the config directory. Connecting to Elasticsearch in the cloud comes down to inserting the address of the Elasticsearch instance.</p><pre class="crayon-plain-tag"># The Elasticsearch instance to use for all your queries.
elasticsearch_url: `https://....foundcluster.com:9243`</pre><p>Of course, we need to verify that this really works, so we open up Kibana on <a href="http://localhost:5601">http://localhost:5601</a>, select the Logstash index template, with the @timestamp data field as suggested, and open up the discover panel. Now, if there was less than 15 minutes since you inserted your favorite test term in Logstash (previous step), you should see it already. Otherwise, change the date range by clicking on the selector in the top right corner.</p>
<p><img class="alignleft" src="https://raw.githubusercontent.com/babadofar/MyOwnRepo/master/images/kibanatest.png" alt="Kibana test" width="1090"  /></p>
<h4>Locking it down</h4>
<p>Found by Elastic has worked hard to make the previous steps easy. We created an Elasticsearch cluster, fed data into it and displayed in Kibana in less than 5 minutes. We must have forgotten something!? And yes, of course! Something about security. We made sure to use secure connections with SSL, and the address generated for our cluster contains a 32 character long, randomly generated list of characters, which is pretty hard to guess. Should, however, the address slip out of our hands, hackers could easily delete our entire cluster. And we don’t want that to happen. So let’s see how we can make everything work when we add some basic security measures.</p>
<h4>Access Control Lists</h4>
<p>Found by Elastic has support for access control lists, where you can set up lists of usernames and passwords, with lists of rules that deny/allow access to various paths within Elasticsearch. This makes it easy to create a &#8220;read only&#8221; user, for instance, by creating a user with a rule that only allows access to the <code>/_search</code> path. Found by Elastic has a sample configuration with users searchonly and readwrite. We will use these as starting point but first we need to figure out what Kibana needs.</p>
<h4>Kibana 4 Security</h4>
<p>Kibana 4 stores its configuration in a special index, by default named &#8220;.kibana&#8221;. The Kibana webserver needs write access to this index. In addition, all Kibana users need write access to this index, for storing dashboards, visualizations and searches, and read access to all the indices that it will query. More details about the access demands of Kibana 4 can be found on the <a href="http://www.elastic.co/guide/en/shield/current/_shield_with_kibana_4.html">elastic blog</a>.</p>
<p>For this demo, we will simply copy the “readwrite” user from the sample twice, naming one kibanaserver, the other kibanauser.</p><pre class="crayon-plain-tag">Setting Access Control in Found:
# Allow everything for the readwrite-user, kibanauser and kibanaserver
- paths: ['.*']
conditions:
- basic_auth:
users:
- readwrite
- kibanauser
- kibanaserver
- ssl:
require: true
action: allow</pre><p>Press save and the changes are immediately effective. Try to reload the Kibana at <a href="http://localhost:5601">http://localhost:5601</a>, you should be denied access.</p>
<p>Open up the kibana.yml file from before and modify it:</p><pre class="crayon-plain-tag"># If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibanaserver
kibana_elasticsearch_password: `KIBANASERVER_USER_PASSWORD`</pre><p>Stop and start Kibana to effectuate settings.<br />
Now when Kibana starts up, you will be presented with a login box for HTTP authentication.<br />
Type in kibanauser as the username, and the password . You should now again be presented with the Discover screen, showing the previously entered favorite test term. Again, you may have to expand the time range to see your entry.</p>
<h4>Logstash Security</h4>
<p>Logstash will also need to supply credentials when connecting to Found by Elastic. We reuse permission from the readwrite user once again, this time giving the name &#8220;logstash&#8221;.<br />
It is simply a matter of supplying the username and password in the configuration file.</p><pre class="crayon-plain-tag">output {
  elasticsearch {
    ….
    user =&gt; “logstash”,
    password =&gt; `LOGSTASH_USER_PASSWORD`
  }
}</pre><p></p>
<h4>Wrapping it up</h4>
<p>This has been a short dive into Logstash and Kibana with Found by Elastic. The recent changes done in order to support the Shield plugin for Elasticsearch, Logstash and Kibana, make it very easy to use the secure features of Found by Elastic. In the next post we will look into feeding logs from IIS into Elasticsearch via Logstash, and visualizing the most used query terms in Kibana.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to develop Logstash configuration files</title>
		<link>http://blog.comperiosearch.com/blog/2015/04/10/how-to-develop-logstash-configuration-files/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/04/10/how-to-develop-logstash-configuration-files/#comments</comments>
		<pubDate>Fri, 10 Apr 2015 12:06:17 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[elastic]]></category>
		<category><![CDATA[logs]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3471</guid>
		<description><![CDATA[Installing logstash is easy. Problems arrive only once you have to configure it. This post will reveal some of the tricks the ELK team at Comperio has found helpful. Write configuration on the command line using the -e flag If you want to test simple filter configurations, you can enter it straight on the command [...]]]></description>
				<content:encoded><![CDATA[<p>Installing logstash is easy. Problems arrive only once you have to configure it. This post will reveal some of the tricks the ELK team at Comperio has found helpful.</p>
<h4><span id="more-3471"></span>Write configuration on the command line using the -e flag</h4>
<p>If you want to test simple filter configurations, you can enter it straight on the command line using the -e flag.</p><pre class="crayon-plain-tag">bin\logstash.bat  agent  -e 'filter{mutate{add_field =&gt; {"fish" =&gt; “salmon”}}}'</pre><p>After starting logstash with the -e flag, simply type your test input into the console. (The defaults for input and output are stdin and stdout, so you don’t have to specify it. )</p>
<h4>Test syntax with &#8211;configtest</h4>
<p>After modifying the configuration, you can make logstash check correct syntax of the file, by using the &#8211;configtest (or -t) flag on the command line.</p>
<h4>Use stdin and stdout in the config file</h4>
<p>If your filter configurations are more involved, you can use input stdin and output stdout. If you need to pass a json object into logstash, you can specify codec json on the input.</p><pre class="crayon-plain-tag">input { stdin { codec =&gt; json } }

filter {
    if ![clicked] {
        mutate  {
            add_field =&gt; ["clicked", false]
        }
    }
}

output { stdout { codec =&gt; json }}</pre><p></p>
<h4> Use output stdout with codec =&gt; rubydebug<img class="alignright size-medium wp-image-3472" src="http://blog.comperiosearch.com/wp-content/uploads/2015/04/rubydebyg-300x106.png" alt="rubydebyg" width="300" height="106" /></h4>
<p>Using codec rubydebug prints out a pretty object on the console</p>
<h4>Use verbose or &#8211;debug command line flags</h4>
<p>If you want to see more details regarding what logstash is really doing, start it up using the &#8211;verbose  or &#8211;debug  flags. Be aware that this slows down processing speed greatly!</p>
<h4>Send logstash output to a log file.</h4>
<p>Using the -l “logfile.log” command line flag to logstash will store output to a file. Just watch your diskspace, in particular in combination with the &#8211;verbose flags these files can be humongous.</p>
<h4>When using file input: delete .sincedb files. in your $HOME directory</h4>
<p>The file input plugin stores information about how far logstash has come into processing the files in .sincedb files in the users $HOME directory. If you want to re-process your logs, you have to delete these files.</p>
<h4>Use the input generate stage</h4>
<p>You can add text lines you want to run through filters and output stages directly in the config file by using the generate input filter.</p><pre class="crayon-plain-tag">input {
  generator{
    lines =&gt; [
      '{"@message":"fisk"}',
      '{"@message": {"fisk":true}}',
      '{"notMessage": {"fisk":true}}',
      '{"@message": {"clicked":true}}'
      ]
    codec =&gt; "json"
    count =&gt; 5
  }
}</pre><p></p>
<h4>Use mutate add_tag after each successful stage.</h4>
<p>If you are developing configuration on a live system, adding tags after each stage makes it easy to search up  the log events in Kibana/Elasticsearch.</p><pre class="crayon-plain-tag">filter {
  mutate {
    add_tag =&gt; "before conditional"
  }
  if [@message][clicked] {
    mutate {
      add_tag =&gt; "already had it clicked here"
    }
  } else {
      mutate {
        add_field  =&gt; [ "[@message][clicked]", false]
    }
  }
  mutate {
    add_tag =&gt; "after conditional"
  }
}</pre><p></p>
<h4>Developing grok filters with the grok debugger app</h4>
<p>The grok filter comes with a range of prebuilt patterns, but you will find the need to develop your own pretty soon. That&#8217;s when you open your browser to <a title="https://grokdebug.herokuapp.com/" href="https://grokdebug.herokuapp.com/">https://grokdebug.herokuapp.com/</a> Paste in a representative line for your log, and you can start testing out matching patterns. There is also a discover mode that will try to figure out some fields for you.</p>
<p>The grok constructor, <a title="http://grokconstructor.appspot.com/do/construction" href="http://grokconstructor.appspot.com/do/construction">http://grokconstructor.appspot.com/do/construction</a>  offers an incremental mode, which I have found quite helpful to work with. You can paste in a selection of log lines, and it will offer a range of possibilities you can choose from, trying to match one field at a time.</p>
<h4> SISO</h4>
<p>If possible, pre-format logs so Logstash has less work to do. If you have the option to output logs as valid json, you don&#8217;t need grok filters since all the fields are already there.</p>
<p>&nbsp;</p>
<p>This has been a short runthrough of the tips and tricks we remember to have used. If you know any other nice ways to develop Logstash configurations, please comment below.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/04/10/how-to-develop-logstash-configuration-files/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Replacing FAST ESP with Elasticsearch at Posten</title>
		<link>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/#comments</comments>
		<pubDate>Fri, 20 Mar 2015 10:00:52 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Comperio]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[elastic]]></category>
		<category><![CDATA[fast]]></category>
		<category><![CDATA[geosearch]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>
		<category><![CDATA[posten]]></category>
		<category><![CDATA[tilbudssok]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3364</guid>
		<description><![CDATA[First, some background A few years ago Comperio launched a nifty service for Posten Norge, Norway&#8217;s postal service. Through the service, retail companies can upload their catalogues and seasonal flyers to make the products listed within searchable. Although the catalogue handling and processing is also very interesting, we&#8217;re going to focus on the search side [...]]]></description>
				<content:encoded><![CDATA[<h2>First, some background</h2>
<p>A few years ago Comperio launched a nifty service for <a title="Posten Norge" href="http://www.posten.no/">Posten Norge</a>, Norway&#8217;s postal service. Through the service, retail companies can upload their catalogues and seasonal flyers to make the products listed within searchable. Although the catalogue handling and processing is also very interesting, we&#8217;re going to focus on the search side of things in this post. As Comperio has a long relationship and a great deal of experience with <a title="FAST ESP" href="http://blog.comperiosearch.com/blog/2012/07/30/comperio-still-likes-fast-esp/">FAST ESP</a>, this first iteration of Posten&#8217;s <a title="Tilbudssok" href="http://tilbudssok.posten.no/">Tilbudssok</a> used it as the search backend. It also incorporated Comperio Front, our search middleware product, which recently <a title="Comperio Front" href="http://blog.comperiosearch.com/blog/2015/02/16/front-5-released/">had a big release. </a>.</p>
<h2>Newer is better</h2>
<p>Unfortunately, FAST ESP is getting on a bit and as a result Tilbudssok has been limited by what we can coax out of it. To ensure we provide the best possible search solution we decided it was time to upgrade and chose <a title="Elasticsearch" href="https://www.elastic.co/products">Elasticsearch</a> as the best candidate. If you are unfamiliar with Elasticsearch, take a moment to browse our other <a title="Elasticsearch blog posts" href="http://blog.comperiosearch.com/blog/tag/elasticsearch/">blog posts</a> on the subject. The resulting project had three main requirements:</p>
<ul>
<li>Replace Fast ESP with Elasticsearch while otherwise maintaining as much of the existing architecture as possible</li>
<li>Add geodata to products such that a user could find the nearest store where they were available</li>
<li>Setup sexy log analysis with <a title="Logstash" href="https://www.elastic.co/products/logstash">Logstash</a> and <a title="Kibana" href="https://www.elastic.co/products/kibana">Kibana</a></li>
</ul>
<p></br></p>
<h2>Data Sources, Ingestion and Processing</h2>
<p>The data source for the search system is a MySQL database populated with catalogue and product data. A separate Comperio system generates this data when Posten&#8217;s customers upload PDFs of their brochures i.e. we also fully own the entire data generation process.</p>
<p>The FAST ESP based solution made use of FAST&#8217;s JDBC connector to feed data directly to the search index. Inspired by <a title="Elasticsearch: Indexing SQL databases. The easy way." href="http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/">Christoffers blog post</a>, we made use of the <a title="Elasticsearch JDBC River Plugin" href="https://github.com/jprante/elasticsearch-river-jdbc">JDBC plugin for Elasticsearch</a>. This allowed us to use the same SQL statements to feed Elasticsearch. It took us no more than a couple of hours, including some time wrestling with field mappings, to populate our Elasticsearch index with the same data as the FAST one.</p>
<p>We then needed to add store geodata to the index. As mentioned earlier, we completely own the data flow. We simply extended our existing catalogue/product uploader system to include a store uploader service. Google&#8217;s <a title="Google Geocoder" href="https://code.google.com/p/geocoder-java/">geocoder</a> handled converted addresses to coordinates for use with Elasticsearch&#8217;s geo distance sorting. We now had store data in our database. An extra JDBC river and another round of mapping wrestling got that same data into the Elasticsearch index.</p>
<h2>Our approach</h2>
<p>Before the conversion to Elasticsearch, the Posten system architecture was typical of most Comperio projects. Users interact with a Java based frontend web application. This in turn sends queries to Comperio&#8217;s search abstraction layer, <a title="Comperio Front" href="http://blog.comperiosearch.com/blog/2015/02/16/front-5-released/">Comperio Front</a>. This formats requests such that the system&#8217;s search engine, in our case FAST ESP, can understand them. Upon receiving a response from the search engine, Front then formats it into a frontend friendly format i.e. JSON or XML depending on developer preference.</p>
<p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/tilbudssok_architecture.png"><img class="size-medium wp-image-3422 aligncenter" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/tilbudssok_architecture-300x145.png" alt="Generic Search Architecture" width="300" height="145" /></a></p>
<p>Unfortunately, when we started the project, Front&#8217;s Elasticsearch adapter was still a bit immature. It also felt a bit over kill to include it when Elasticsearch has such a <a href="http://www.elastic.co/guide/en/elasticsearch/client/java-api/current/">robust Java API</a> already. I saw an opportunity to reduce the system&#8217;s complexity and learn more about interacting with Elasticsearch&#8217;s Java API and took it. With what I learnt, we could later beef up Front&#8217;s Elasticsearch adapter for future projects.</p>
<p>As a side note, we briefly flirted with the idea of replacing the entire frontend with a <a href="http://blog.comperiosearch.com/blog/2013/10/24/instant-search-with-angularjs-and-elasticsearch/">hipstery Javascript/Node.js ecosystem</a>. It was trivial to throw together a working system very quickly but in the interest of maintaining existing architecture and trying to keep project run time down we opted to stick with the existing Java based MVC framework.</p>
<p>After a few rounds of Googling, struggling with documentation and finally simply diving into the code, I was able to piece together the bits of the Elasticsearch Java API puzzle. It is a joy to work with! There are builder classes for pretty much everything. All of our queries start with a basic SearchRequestBuilder. Depending on the scenario, we can then modify this SRB with various flavours of QueryBuilders, FilterBuilders, SortBuilders and AggregationBuilders to handle every potential use case. Here is a greatly simplified example of a filtered search with aggregates:</p>
<script src="https://gist.github.com/92772945f5281df54c3b.js?file=SRBExample"></script>
<h2>Logstash and Kibana</h2>
<p>With our Elasticsearch based system up ready to roll, the next step was to fulfil our sexy query logging project requirement. This raised an interesting question. Where are the query logs? As it turns out, (please contact us if we&#8217;re wrong), the only query logging available is something called <a title="Slow Log" href="http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html">slow logging</a>. It is a shard level log where you can set thresholds for the query or fetch phase of the execution. We found this log severely lacking in basic details such as hit count and actual query parameters. It seemed like we could only track query time and the query string.</p>
<p>Rather than fight with this slow log, we implemented our own custom logger in our web app to log salient parts of the search request and response. To make our lives easier everything is logged as JSON. This makes hooking up with <a title="Logstash" href="http://logstash.net/">Logstash</a> trivial, as our logstash config reveals:</p>
<script src="https://gist.github.com/43e3603bd75fd549a582.js?file=logstashconf"></script>
<p><a title="Kibana 4" href="http://blog.comperiosearch.com/blog/2015/02/09/kibana-4-beer-analytics-engine/">Kibana 4</a>, the latest version of Elastic&#8217;s log visualisation suite, was released in February, around the same time as we were wrapping up our logging logic. We had been planning on using Kibana 3, but this was a perfect opportunity to learn how to use version 4 and create some awesome dashboards for our customer:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_query.png"><img class="aligncenter size-medium wp-image-3444" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_query-300x169.png" alt="kibana_query" width="300" height="169" /></a></p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_ams.png"><img class="aligncenter size-medium wp-image-3443" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_ams-300x135.png" alt="kibana_ams" width="300" height="135" /></a></p>
<p>Kibana 4 is wonderful to work with and will generate so much extra value for Posten and their customers.</p>
<h2>Conclusion</h2>
<ul>
<li>Although the Elasticsearch Java API itself is well rounded and complete, its documentation can be a bit frustrating. But this is why we write blog posts to share our experiences!</li>
<li>Once we got past the initial learning curve, we were able to create an awesome Elasticsearch Java API toolbox</li>
<li>We were severely disappointed with the built in query logging. I hope to extract our custom logger and make it more generic so everyone else can use it too.</li>
<li>The Google Maps API is fun and super easy to work with</li>
</ul>
<p>Rivers as a data ingestion tool have long been marked for deprecation. When we next want to upgrade our Elasticsearch version we will need to replace them entirely with some other tool. Although Logstash is touted as Elasticsearch&#8217;s main equivalent of a connector framework, it currently lacks classic Enterprise search data source connectors. <a title="Apache Manifold" href="http://manifoldcf.apache.org/">Apache Manifold</a> is a mature open source connector framework that would cover our needs. The latest release has not been tested with the latest version of Elasticsearch, but it supports versions 1.1-3.</p>
<p>Once the solution goes live, during April, Kibana will really come into its own as we get more and more data.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Elastic{ON}15: Day two</title>
		<link>http://blog.comperiosearch.com/blog/2015/03/19/elasticon15-day-two/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/03/19/elasticon15-day-two/#comments</comments>
		<pubDate>Thu, 19 Mar 2015 20:59:41 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[aggregations]]></category>
		<category><![CDATA[elastic]]></category>
		<category><![CDATA[Elasticon]]></category>
		<category><![CDATA[facebook]]></category>
		<category><![CDATA[goldman sachs]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[mars]]></category>
		<category><![CDATA[Microsoft]]></category>
		<category><![CDATA[nasa]]></category>
		<category><![CDATA[resiliency]]></category>
		<category><![CDATA[security]]></category>
		<category><![CDATA[shield]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3411</guid>
		<description><![CDATA[March 11, 2015 Keynote Fighting the crowds to find a seat for the keynote at Day 2 at elastic{ON}15 we were blocked by a USB stick with the curious caption  Microsoft (heart) Linux. Things have certainly changed. Microsoft The keynote, led by Elastic SVP of sales Aaron Katz, included Pablo Castro of Microsoft who was [...]]]></description>
				<content:encoded><![CDATA[<h6>March 11, 2015</h6>
<h4>Keynote</h4>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/msheartlinux.jpg"><img class="alignright size-medium wp-image-3412" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/msheartlinux-300x118.jpg" alt="msheartlinux" width="300" height="118" /></a>Fighting the crowds to find a seat for the keynote at Day 2 at elastic{ON}15 we were blocked by a USB stick with the curious caption  Microsoft (heart) Linux. Things have certainly changed.</p>
<p><span id="more-3411"></span></p>
<h5>Microsoft</h5>
<p>The keynote, led by Elastic SVP of sales Aaron Katz, included Pablo Castro of Microsoft who was keen to explain how this probably isn’t so far from the truth. Elasticsearch is used  internally in several Microsoft products among Linux and other open source software and this is a huge change from the Microsoft we know from around five years ago. Pablo revealed some details towards how elasticsearch is used as data storage and search platform in MSN, Microsoft Dynamics and Azure Search. Microsoft truly has gone through some fundamental changes lately embracing open source both internally and externally. We see this as a demonstration of the power of open source and the huge value of Elastic(search) brings to  many organizations. As Jordan Sissel said in the keynote yesterday “If a user has a problem, it is a bug”. This is a philosophical stance towards a conception of software as an enabler of  creativity and growth, in contrast to viewing software as a fixed product packaged for sale.</p>
<h5>Goldman Sachs</h5>
<p>Microsofts contribution was in the middle part of the keynote. The first part was a discussion with Don Duet, managing director of Goldman Sachs. Goldman Sachs provides financial services on a global scale, and has been on the forefront of technology since its inception in 1869. They were an early adopter of Elasticsearch since it was as an easy to use search and analytics tool for big data. Goldman Sachs is now using elasticsearch extensively as a key part of their technological stack.</p>
<h5>NASA</h5>
<p>The most mind blowing part of the keynote was the last one held by two chaps from the Jet Propulsion Labs team at NASA, Ricky Ma and Don Isla. They first showed their awesome internal search with previews, and built in rank tuning. Then they talked about the Mars Curiosity rover, a robot planted on Mars which runs around taking samples and selfies. It constantly sends data back to earth where the JPL team analyzes the operations of the rover. Elasticsearch is naturally at the center of this interplanetary operation, nothing less.</p>
<div style="width: 352px" class="wp-caption alignright"><img src="http://i.imgur.com/UACwKNR.jpg" alt="It definitely takes better selfies than me" width="342" height="240" /><p class="wp-caption-text">Mars Curiosity Rover Selfie</p></div>
<p>The remainder of the day contained sessions across the same three tracks as the first day. In addition five tracks of birds of a feather or “lounge” sessions were held where people gathered in smaller groups to discuss various topics.  Needless to say the breadth of the program meant we were stretched thin. We chose to focus on three topics that are of particular importance to our customers: aggregations, security &amp; Shield, and resiliency</p>
<h4>More aggregations</h4>
<p>Adrien Grand &amp; Colin Goodheart-Smithe did a deep dive into the details of aggregations and how they are computed. In particular how to tune them and the results in terms of execution complexity. A key point is the approximations that are employed to compute some of the aggregations which involve certain trade offs in speed over accuracy. Aggregations are a very powerful feature requiring some some planning to be feasible and efficient.</p>
<h4><b>Security/Shield</b></h4>
<p>Uri Boness talked about Shield and the current state of authentication &amp; authorization, He provided some pointers to what is on the roadmap for the coming releases. Unfortunately, there does not appear to be any concrete plans for providing built in document level security. This is a sought after feature that would certainly make the product more interesting in many enterprise settings. Then again, there are companies who provide connector frameworks that include security solutions for elasticsearch. We had a chat with some of them at the conference, including Enonic, SearchBlox and Search Technologies.</p>
<h4><b>Facebook</b></h4>
<p>Peter Vulgaris from Facebook explained how they are using elasticsearch. To me, the story resembled Microsoft’s. Facebook has heaps of data, and lots of use cases for it. Once they started to use elasticsearch it was widely adopted in the company and the amount of data indexed grew ever larger which forced them to think more closely about how they manage their clusters.</p>
<p>&nbsp;</p>
<h4><b>Resiliency</b></h4>
<p>Elasticsearch is a distributed system, and as such shares the same potential issues as other distributed systems. Boaz Leskes &amp; Igor Motov explained the measures that have been undertaken in order to avoid problems such as “split-brain” syndrome. This is when a cluster is confused about what node should be considered the master. Data safety and security are important features of Elasticsearch and there is a continuous effort in place in these areas.</p>
<p>&nbsp;</p>
<h4><b>Lucene</b></h4>
<p>Stepping back to day 1 and the Lucene session featuring the mighty Robert Muir, we learned that Lucene version 5 includes a lot of improvements. Especially performance wise regarding compression both on indexing and query times which enables faster execution times and reduces resource consumption. There has also been made efforts to the Lucene core enabling a merging of query and filter as two sides of the same coin. After all a query is just  a filter with a relevance score. On another note Lucene will now handle caching of queries by itself.</p>
<h4><b>Wrapping it up</b></h4>
<p>Elastic{ON}15 stands as a confirmation of the attitude that were essential in the creation of the elasticsearch project. The visions that guided the early development are still valid today, except the scale is larger. The recent emphasis on stability, security and resiliency will welcome a new wave of users and developers.</p>
<p>At the same time there is a continuous exploration and development into big data related analytics but with the speed and agility we have come to expect from Elasticsearch.</p>
<p>Thanks for this year, looking forwards to next!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/03/19/elasticon15-day-two/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Elastic{ON}15: Day one</title>
		<link>http://blog.comperiosearch.com/blog/2015/03/11/elasticon15-day-one/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/03/11/elasticon15-day-one/#comments</comments>
		<pubDate>Wed, 11 Mar 2015 16:07:48 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[.net]]></category>
		<category><![CDATA[aggregations]]></category>
		<category><![CDATA[Elasticon]]></category>
		<category><![CDATA[found]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>
		<category><![CDATA[san francisco]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3393</guid>
		<description><![CDATA[March 10, 2015 At Comperio we have been speculating for a while now that Elasticsearch might just drop search from their name. With Elasticsearch spearheading the expansion of search into analytics and all sorts of content and data driven applications such a change made sense to us. What the name would be we had no [...]]]></description>
				<content:encoded><![CDATA[<h6>March 10, 2015<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/IMG_20150310_1112452cropped.jpg"><img class="alignright size-medium wp-image-3396" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/IMG_20150310_1112452cropped-300x140.jpg" alt="IMG_20150310_111245~2cropped" width="300" height="140" /></a></h6>
<p>At Comperio we have been speculating for a while now that Elasticsearch might just drop search from their name. With Elasticsearch spearheading the expansion of search into analytics and all sorts of content and data driven applications such a change made sense to us. What the name would be we had no idea about however &#8211; ElasticStash, KibanElastic StashElasticLog &#8211; none of these really rolled of the tongue like a proper brand.</p>
<p>More surprising is the Elasticsearch move into the cloud space by acquiring Found. A big and heartfelt congratulations to our Norwegian colleagues from us at Comperio. Found has built and delivered an innovative and solid product and we look forward to seeing them build something even better as a part of Elastic.</p>
<p>Elasticsearch is renamed to Elastic, and Found is no longer just Found, but Found by Elastic. The opening keynote held by CEO Steven Shuurman and Shay Banon was a tour of triumph through the history of Elastic, detailing how the company has grown sort of in an organic, natural manner, into what it is today. Kibana and Logstash started as separate projects but were soon integrated into Elastic. Shay and Steven explained how old roadmaps for the development of Elastic included plans to create CloudES, search as a cloud service. CloudES was never created, due to all the other pressing issues. Simultaneously, the Norwegian company Found made great strides with their cloud search offering, and an acquisition became a very natural fit.</p>
<p>Elastic{ON} is the first conference devoted entirely to the Elastic family of products. The sessions consist on one hand of presentations by developers and employees of Elastic, on the other there is “ELK in the wild” showcasing customer use cases, including Verizon, Github, Facebook and more.</p>
<p>On day one the sessions about core elasticsearch, Lucene, Kibana and Logstash were of particular interest to us.</p>
<h4><strong>Elasticsearch</strong></h4>
<p>The session about “Recent developments in elasticsearch 2.0” held by Clinton Gormley and Simon Wilnauer revealed a host of interesting new features in the upcoming 2.0 release. There is a very high focus on stability, and making sure that no releases should contain bugs. To illustrate this Clinton showed graphs relating the number of lines of code compared to lines of tests, where the latter was rising sharply in the latest releases. It was also interesting to note that the number of lines of code has been reduced recently due to refactoring and other improvements to the code base.</p>
<p>Among interesting new features are a new “reducer” step for aggregations allowing calculations to be done on top of aggregated results and a Changes API which helps managing changes to the index. The Changes API will be central in creating other features, for example update by query, where a typical use case involves logging search results, where the changes API will allow updating  information about click activity in the same log entry as the one containing the query.</p>
<p>There will also be a Reindex API that simplifies the development cycle when you have to refeed an entire index because you need to change a mapping or field type.</p>
<h4>Kibana</h4>
<p>Rashid Khan went through the motivations behind the development of Kibana 4, where support for aggregations, and making the product easier to work with and extendable really makes this into a fitting platform for creating tools for creating visualizations of data. Followed by “The Contributor&#8217;s Guide to the Kibana Galaxy” by Spencer Alger who demoed how to setup the development environment for Kibana 4 using using npm, grunt and bower- the web development standard toolset of today ( or was it yesterday?)</p>
<h4>Logstash</h4>
<p>Logstash creator Jordan Sissel presented the new features of Logstash 1.5, and what to expect in future versions. 1.5 introduces a new plugin system, and to great relief of all windows users out there the issues regarding file locking on rolling log files have been resolved! The roadmap also aims to vastly improve the reliability of Logstash, no more losing documents in planned or unplanned outages. In addition there are plans to add event persistence and various API management tools. As a consequence of the river technology being deprecated, Logstash will take the role as document processing framework that those of us who come from FAST ESP have missed for some time now. So in effect, all rivers, (including JDBC) will be ported to Logstash.</p>
<h4>Aggregations</h4>
<p>Mark Harwood presented a novel take on optimizing index creation for aggregations in the session “Building Entity Centric Indexes”. You may have tried to run some fancy aggregations,only to have elasticsearch dying from out of memory errors. Avoiding this often takes some insight into the architecture to<br />
structure your aggregations in the best possible manner. Mark essentially showed how to move some of the aggregation to indexing time rather than query time. The original use case was a customer who needed to know what is the average session length for the users of his website. Figuring that out involved running through the whole index, sorting by session id, picking the timestamp of the first item and subtracting from the second, a lot of operations with an enormous consumption of resources. Mark approaches the problems in a creative and mathematical manner, and it is always inspiring to attend his presentations. It will be interesting to see whether the Changes API mentioned above will deliver functionality that can be used to improve aggregated data.</p>
<h4>.NET</h4>
<p>Deep dive into the .NET clients with Martijn Laarman showed how to use a strongly typed language as C# with elasticsearch. Yes, it is actually possible, and it looked very good. There is a low-level client that just connects to the api where you have to to do all the parsing yourself, and a high-level client called NEST building on top of that offering a strongly typed query DSL having almost 1 to 1 mapping to the elasticsearch dsl. Particularly nifty was the covariant result handling, where you can specify the type of results you need back, considering a search result from elasticsearch can contain many types.</p>
<p>Looking forwards to day 2!<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/IMG_20150310_213606.jpg"><img class="alignright size-medium wp-image-3391" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/IMG_20150310_213606-300x222.jpg" alt="IMG_20150310_213606" width="300" height="222" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/03/11/elasticon15-day-one/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Comperio goes to Elasticon</title>
		<link>http://blog.comperiosearch.com/blog/2015/02/27/comperio-goes-elasticon/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/02/27/comperio-goes-elasticon/#comments</comments>
		<pubDate>Fri, 27 Feb 2015 14:24:38 +0000</pubDate>
		<dc:creator><![CDATA[André Lynum]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[elasticon elk elasticsearch kibana analytics]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3358</guid>
		<description><![CDATA[Elasticon, the first Elasticsearch user conference, is coming in a couple of weeks. Hosted in San-Francisco, the agenda promises a lot of interesting use cases and in-depth information about Elasticsearch and the ELK (Elasticsearch, Logstash, Kibana) analytics stack. It is an 11 hour plane trip away, but here at Comperio we consider Elasticsearch one of [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/elasticon.png"><img class="alignnone size-medium wp-image-3359" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/elasticon-300x120.png" alt="elasticon" width="300" height="120" /></a></p>
<p />
<a href="http://elasticon.com">Elasticon</a>, the first Elasticsearch user conference, is coming in a couple of weeks. Hosted in San-Francisco, the agenda promises a lot of interesting use cases and in-depth information about Elasticsearch and the ELK (Elasticsearch, Logstash, Kibana) analytics stack. It is an 11 hour plane trip away, but here at Comperio we consider Elasticsearch one of the most exciting developments in search today. Not only is it a scalable and flexible search platform but it is also at forefront of combining data analytics, text mining and information retrieval in a single scalable and cohesive platform. So it doesn&#8217;t matter if Elasticon is nearly on the other side of the planet, we can&#8217;t really miss this opportunity to get on top of the latest developments surrounding Elasticsearch.</p>
<p />
<p>At Comperio we&#8217;re happy to see that search is becoming so much more than it used to be. Elasticsearch has proven to be a platform that not only does search well, but also integrates documents with data in a way that enables information oriented applications. At the front of this wave of new applications is the ELK stack which makes it easy to build a complete pipeline for analytics. Search analytics, system monitoring or web analytics are all areas where a realtime reporting platform can be built on top of ELK. Combining the data analytics capabilities of Elasticsearch also allows us to build information insight driven applications for our customers, combining offline and realtime text analysis and data aggregation with web based visualization based on D3.js. Especially the real time queries has made it possible for users to drill-down into and compare specific pieces of data in an exploratory manner. These new applications does not only require a fast and scalable technology core, but also solid insight into search technology and an ability to modify it for specific requirements.</p>
<p />
On to Elasticon it is no surprise that there is a huge focus on all aspects of ELK, We&#8217;re looking forward to see how other companies are adopting the ELK stack to their projects and get new ideas of how we can help our customers bring out the value in their data with the ELK software. There are also some sessions on Shield, the new access security subsystem for Elasticsearch. This is a long asked for component which will surely make integration easier in many projects. We&#8217;re also looking forward to the sessions on Elasticsearch and ELK internals as we are now looking into several extensions that we&#8217;d like to implement in Elasticsearch.</p>
<p />
On to the sessions we&#8217;re looking forward to the ELK use cases, especially the <em>&#8220;Tackling Security Logs with the ELK&#8221;</em> and <em>&#8220;The ELK &amp; The Eagle: Search &amp; Analytics for the US Government&#8221;</em>. Also ELK internals sessions such as <em>&#8220;The Contributor&#8217;s Guide to the Kibana&#8221;</em> and <em>&#8220;Life of an Event in Logstash&#8221;</em>, in addition to <em>&#8220;Elasticsearch Architecture: Amusing Algorithms and Details on Data Structures&#8221;</em>. Building data centric applications is also covered in <em>&#8220;Using Elasticsearch to Unlock an Analytical Goldmine&#8221;</em>, <em>&#8220;Navigating Through the World&#8217;s Encyclopedia&#8221;</em> and <em>&#8220;The ELK Stack for Time Series Data&#8221;</em> which we hope can give us some fresh viewpoints on using Elasticsearch in our projects related to intelligence and data mining.</p>
<p />
Hope to see some of you in San Francisco :)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/02/27/comperio-goes-elasticon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Kibana 4 &#8211; the beer analytics engine</title>
		<link>http://blog.comperiosearch.com/blog/2015/02/09/kibana-4-beer-analytics-engine/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/02/09/kibana-4-beer-analytics-engine/#comments</comments>
		<pubDate>Mon, 09 Feb 2015 00:20:36 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3226</guid>
		<description><![CDATA[Kibana 4 is a great tool for analyzing data. Vinmonopolet, the Norwegian government owned alcoholic beverage retail monopoly, makes their list of products available online in an easily digestible csv format. So, what beer should I buy next? Kibana will soon tell me. Kibana 4 is a data visualization and analytics tool for elasticsearch. Kibana [...]]]></description>
				<content:encoded><![CDATA[<p>Kibana 4 is a great tool for analyzing data. Vinmonopolet, the Norwegian government owned alcoholic beverage retail monopoly, makes their list of products available online in an <a href="http://www.vinmonopolet.no/artikkel/om-vinmonopolet/datadeling">easily digestible csv format</a>. So, what beer should I buy next? Kibana will soon tell me.</p>
<p><span id="more-3226"></span></p>
<p>Kibana 4 is a data visualization and analytics tool for elasticsearch. Kibana 4 was launched in February 2015, and builds on top of Kibana 3, incorporating user feedback and recent developments in elasticsearch, the most mind blowing being the support for aggregations. Aggregations are like facets/navigators/refiners on steroids, with a lot of advanced options for data drill-down. But no matter how easy a tool is to use, it only gets interesting once we have some questions that need to be answered. So what I want to know is:</p>
<h4>1. What beer gives the most value for money?</h4>
<h4>2. What is the most Belgian of Belgian beers?</h4>
<h4>3. Which of the most Belgian beers give the most value for money?</h4>
<p>The dataset from Vinmonopolet does not contain the important metric &#8220;price pr unit of alcohol&#8221;. So to begin with, we need to add that. It could have been done in Excel, or as part of preprocessing. Since this post isn&#8217;t about how to get data data indexed in elasticsearch we use a nice new feature of Kibana that lets you add calculated fields.</p>
<p>In the Settings -&gt; Indices section, there is an an option to create a Scripted Field.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/scriptedfield2.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/scriptedfield2.png" alt="scriptedfield2" width="426" height="309" class="alignright size-full wp-image-3352" /></a></p>
<p>The field for price pr.unit of alcohol is added as calculation in the scripted field, flooring the number to the nearest integer. Scripting is done using Lucene Expressions, after some vulnerabilites were discovered with using Groovy as scripting language (this changed from RC to the final release of Kibana).</p>
<h3>What beer gives the most value for money?</h3>
<p>Now we can create a nice little bar chart in Kibana. Using the minimum pricePrAlcohol as Y-axis, bottom terms Varenavn as X-axis.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/minPriceBeers2.png"><img class="alignright size-full wp-image-3273" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/minPriceBeers2.png" alt="minPriceBeers" width="583" height="491" /></a></p>
<p>&nbsp;</p>
<p>The chart reveals that beer with the best alcohol/price ratio is <a href="http://www.sheltonbrothers.com/beers/mikkeller-arh-hvad/">Mikkeler Årh Hvad?!</a>, A very nice beer, I had it last week. Mikkeler is a Danish brewery, but they brew most of their beer in Belgium, so this is actually a Belgian beer.</p>
<h3>What is the most Belgian of Belgian beers?</h3>
<p>Next up I want to figure out what is the most Belgian of Belgian beers. Most of the products in Vinmonopolet&#8217;s catalogue have entries for &#8220;Smak&#8221;, or &#8220;Taste&#8221;. Let&#8217;s put the significant terms aggregation to work on &#8220;Smak&#8221; and see what falls out.</p>
<p><img class="alignright wp-image-3294 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/beersigtermspie-293x300.png" alt="beersigtermspie" width="293" height="300" /><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/beersigtermslegend.png"><img class="alignright wp-image-3293 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/beersigtermslegend.png" alt="beersigtermslegend" width="96" height="439" /></a></p>
<p>The pie chart shows countries in the inner circle, and significant terms in the outer circle. The largest pie belongs to Norwegian beers, as shown in the legend on the right. Using Kibana, you can also hover over the entries to highlight the selection in the pie chart, very nice feature especially for the colourly challenged population that are unable to match colors. Kibana allows drill down by clicking on pie slices, and you can see the data table and other details by clicking on the small arrow at the bottom.</p>
<p>The most significant terms for Belgian beers according to this query is &#8220;bread&#8221;, &#8220;yeast&#8221;, &#8220;malt&#8221; and &#8220;malty&#8221;. That&#8217;s hardly surprising since this is beer. We should expect something a little more specific. The significant terms aggregation returns terms that are more frequent in a foreground selection, compared to a background selection. In our case, we select product of type beer, from country Belgium, and the background is by default the contents of the entire index, or in other words, the complete product catalog from Vinmonopolet. This catalog contains a vast amount of wine, liquor and other irrelevant items. Since we are really only interested to see the significant terms of Belgian beers compared to other beers, we can add a custom parameter to select the background manually. Paste this into the JSON input of the advanced section.</p><pre class="crayon-plain-tag">{
    "background_filter": {
        "term": {
            "Varetype": "Øl"
        }
    }
}</pre><p>Using this filter, the significant terms for Belgian beers are &#8220;impact, plum, lemon, bread&#8221;.</p>
<p>&nbsp;</p>
<p>What beers actually match these descriptions? Some suggestions can be revealed through nesting an aggregation on product name, on top of the one we already have.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeersWithSigTerms.png"><img class="alignright size-full wp-image-3275" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeersWithSigTerms.png" alt="belgianbeersWithSigTerms" width="827" height="487" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>The non-colourly-challenged may easily see that Het Anker Lucifer matches both &#8220;anslag&#8221; (impact) and &#8220;sitrus&#8221; (lemon). Some beers match two terms, others match one, none match all four terms. Ideally, the most Belgian of Belgian beers should contain all the most significant terms. The significant terms are &#8220;impact,bread,lemon,plum&#8221; (&#8220;anslag,brødbakst,sitrus,plomme&#8221;).  Typing this as a Lucene query into the Discover tab on Kibana.</p><pre class="crayon-plain-tag">Land:Belgia AND Smak:sitrus,plomme,br&oslash;dbakst,anslag AND Varetype:&Oslash;l</pre><p>Returns &#8220;<span style="color: #444444;">Silly Green Killer IPA&#8221; at result number 1, having Smak:  &#8220;<strong>Fruktig <mark>anslag</mark> med <mark>sitrus</mark>, korn, humle og <mark>brødbakst</mark>. Lang, frisk avslutning.&#8221; </strong>Containing three of the terms; impact, lemon and bread. Since no beers contain all four terms, we can hereby pronounce a winner of most Belgian of all Belgian beers according to Vinomonpolet catalogoue (and a ridiculous significant terms trick): Silly Green Killer IPA! Congratulations! </span></p>
<h3>Which of the most Belgian beers give the most value for money?</h3>
<p>The previous investigation did not take economic considerations into account. Using the Line Chart, reusing the saved search from the previous query, adding the minimum pricePrAlcohol as Y-axis, and setting the X-axis to the terms aggregation for Varenavn (product name) bumping it up to 52 entries to make sure it contains all the results. The graph shows all beers containing at least one of our sought after terms. The Silly Green Killer IPA can be found at the upper quart of the table having a price pr alcohol unit at 27.51. Abbaye de Rocs Bruin comes in as a winner at the bottom edge of the scale, with a whooping 13.43 NOK pr alcohol unit, having a Smak field containing only the term &#8220;sitrus&#8221; (lemon).</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeerPriceDist.png"><img class="aligncenter wp-image-3302 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeerPriceDist-e1423440118217.png" alt="belgianbeerPriceDist" width="700" height="361" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>It would be nice to see what terms the beers contain, to enable a qualified judgement. Kibana allows to split up the display into several graphs. I will use this together with the filter aggregation to show one graph for each of the significant terms.</p>
<p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeerswithsigtermsAndAlc.png"><img class="alignleft wp-image-3303" src="http://blog.comperiosearch.com/wp-content/uploads/2015/02/belgianbeerswithsigtermsAndAlc.png" alt="belgianbeerswithsigtermsAndAlc" width="701" height="405" /></a></p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>The graphs are, from top to bottom: sitrus (lemon), brødbakst (bread), anslag (impact), plomme (plum). The colors indicate alcohol content.</p>
<p>In this post, I have tried to show how you can use Kibana 4 and elasticsearch for data exploration and analysis. Please use the comment form below or contact me if you have any questions. If you enjoyed this article, why don&#8217;t you give me a <a href="https://untappd.com/user/Babadofar">toast on Untapped</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/02/09/kibana-4-beer-analytics-engine/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>
