<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; grok</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/grok/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Analysing Solr logs with Logstash</title>
		<link>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/#comments</comments>
		<pubDate>Sun, 20 Sep 2015 22:00:00 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[grok]]></category>
		<category><![CDATA[logs]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3934</guid>
		<description><![CDATA[Analysing Solr logs with Logstash Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you&#8217;re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK [...]]]></description>
				<content:encoded><![CDATA[<h1>Analysing Solr logs with Logstash</h1>
<p>Although I usually write about and work with <a href="http://lucene.apache.org/solr/">Apache Solr</a>, I also use the <a href="https://www.elastic.co/downloads">ELK stack</a> on a daily basis on a number of <a>projects.</a> If you&#8217;re not familiar with Solr, take a look at some of my <a href="http://blog.comperiosearch.com/blog/author/sebm/">previous posts</a>. If you need some more background info on the ELK stack, both <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christoffer</a> and <a href="http://blog.comperiosearch.com/blog/author/alynum/">André</a> have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.</p>
<p>As a little side note for the truly devoted Solr users, an ELK stack alternative exists with <a href="http://lucidworks.com/fusion/silk/">SiLK</a>. I highly recommend checking out Lucidworks&#8217; various blog posts on <a href="http://lucidworks.com/blog/">Solr and search in general</a>.</p>
<h2>Some background</h2>
<p>On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio&#8217;s search middleware application.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs.png"><img class="aligncenter size-medium wp-image-3942" src="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs-300x157.png" alt="Search Logs Dashboard" width="300" height="157" /></a><br />
Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.</p>
<h2>Lets get started</h2>
<p>I&#8217;m going to assume you already have a running Solr installation. You will, however, need to download <a href="https://www.elastic.co/products/elasticsearch">Elasticsearch</a> and <a href="https://www.elastic.co/products/logstash">Logstash</a> and unpack them. Before we start Elasticsearch, I recommend installing these plugins:</p>
<ul>
<li><a href="http://mobz.github.io/elasticsearch-head/">Head</a></li>
<li><a href="https://www.elastic.co/guide/en/marvel/current/_installation.html">Marvel</a></li>
</ul>
<p>Head is a cluster health monitoring tool. Marvel we&#8217;ll only need for the bundled developer console, Sense. To disable Marvel&#8217;s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml</p><pre class="crayon-plain-tag">marvel.agent.enabled: false</pre><p>Start elasticsearch with this command:</p><pre class="crayon-plain-tag">~/elasticsearch-[version]/bin/elasticsearch</pre><p>Navigate to <a href="http://localhost:9200/">http://localhost:9200/</a> to confirm that Elasticsearch is running. Check <a href="http://localhost:9200/_plugin/head">http://localhost:9200/_plugin/head</a> and <a href="http://localhost:9200/_plugin/marvel/sense/index.html">http://localhost:9200/_plugin/marvel/sense/index.html</a> to verify the plugins installed correctly.</p>
<h2>The anatomy of a Logstash config</h2>
<hr />
<h3>Update 21/09/15</h3>
<p>I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: <script src="https://gist.github.com/41ca2c34c50d0d9d8e82.js?file=solr-filter.conf"></script>The rest of the original article contents are unchanged for comparison&#8217;s sake.</p>
<hr />
<p>All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:</p><pre class="crayon-plain-tag">input {
  file {
    path =&gt; "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host =&gt; "localhost"
    template =&gt; "~/logstash/bin/logstash_solr_template.json"
    index =&gt; "solr-%{+YYYY.MM.dd}"
    template_overwrite =&gt; true
  }</pre><p>This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the <a href="https://www.elastic.co/guide/en/logstash/current/input-plugins.html">input</a> and <a href="https://www.elastic.co/guide/en/logstash/current/output-plugins.html">output</a> plugins&#8217; documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-template-json">here</a>.</p>
<p>To process the Solr logs, we&#8217;ll use the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html">grok</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html">mutate</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-multiline.html">multiline</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-drop.html">drop</a> and <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html">kv</a> filter plugins.</p>
<ul>
<li>Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the <a href="https://grokdebug.herokuapp.com/">grok debugger app</a> is particularly helpful. Be mindful though that some of the escaping syntax isn&#8217;t always the same in the app as what the Logstash config expects.</li>
<li>We need the multiline plugin to link stacktraces to their initial error message.</li>
<li>The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events</li>
<li>We use mutate to add and remove tags along the way.</li>
<li>And finally, drop to drop any events we don&#8217;t want to keep.</li>
</ul>
<p>&nbsp;</p>
<h2>The <del>hard</del> fun part</h2>
<p>Lets dive into the filter stage now. Take a look at the <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-logstash-conf">config file</a> I&#8217;m using. The Grok patterns may appear a bit daunting, especially if you&#8217;re not very familiar with regexp and the default Grok patterns, but don&#8217;t worry! Lets break it down.</p>
<p>The first section extracts the log event&#8217;s severity and timestamp into their own fields, &#8216;level&#8217; and &#8216;LogTime&#8217;:</p><pre class="crayon-plain-tag">grok {
    match =&gt; { "message" =&gt; "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure =&gt; []
  }</pre><p>So, given this line from my <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-log">example log file</a></p><pre class="crayon-plain-tag">INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&amp;literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.id=epifile_211278&amp;literal.epifileid_s=211278&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;literal.filesource_s=SiteFile} {} 0 65</pre><p>We&#8217;d extract</p><pre class="crayon-plain-tag">{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}</pre><p>In the template file I linked earlier, you&#8217;ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we&#8217;d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you&#8217;ll notice I use</p><pre class="crayon-plain-tag">tag_on_failure =&gt; []</pre><p>in most of my Grok stages. The default value is &#8220;_grokparsefailure&#8221;, which I don&#8217;t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.</p>
<p>The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.</p><pre class="crayon-plain-tag"># Combine commit events into single message
  multiline {
      pattern =&gt; "^\t(commit\{)"
      what =&gt; "previous"
    }</pre><p>Now we come to a major section for handling general INFO level messages.</p><pre class="crayon-plain-tag"># INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure =&gt; []  
    }
    if [params] {
      kv {
        field_split =&gt; "&amp;"
        source =&gt; "params"
      }
    } else {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure =&gt; [ "drop" ]
        add_field =&gt; {
          "action" =&gt; "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }</pre><p>This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document&#8217;s extracted contents look like when stored in Elasticsearch:</p><pre class="crayon-plain-tag">{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&amp;literal.epi_file_title=A05100_Tass5+Trondheim.pdf&amp;literal.title=A05100_Tass5+Trondheim.pdf&amp;literal.id=epifile_211027&amp;literal.epifileid_s=211027&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}</pre><p>If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with &#8220;drop&#8221;. Finally, any messages tagged with &#8220;drop&#8221; are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:</p><pre class="crayon-plain-tag">.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}</pre><p>The next section handles ERROR level messages:</p><pre class="crayon-plain-tag"># Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern =&gt; "^\s"
      what =&gt; "previous"
      add_tag =&gt; [ "multiline_pre" ]
    }
    multiline {
        pattern =&gt; "^Caused by"
        what =&gt; "previous"
        add_tag =&gt; [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure =&gt; []
      }
    }
  }</pre><p>Given a stack trace, there&#8217;s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.</p>
<p>Finally, I drop any empty lines and clean up temporary tags:</p><pre class="crayon-plain-tag"># Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag =&gt; [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }</pre><p>To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:</p><pre class="crayon-plain-tag"># aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}</pre><p>You should get back all your processed log events along with an aggregation on event severity.</p>
<h2>Conclusion</h2>
<p>Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I&#8217;ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
