Search Nuggets » logstash

ELK stack deployment with Ansible

Christoffer Vig — Thu, 26 Nov 2015 09:59:38 +0000

As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way.

Ansible is a free software platform for configuring and managing computers, and I’ve been using it a lot lately to manage the ELK stack. Elasticsearch, Logstash and Kibana.

I can define a list of servers I want to manage in a YAML config file – the so called inventory:

[elasticearch-master]
es-master1.mydomain.com
es-master2.mydomain.com
es-master3.mydomain.com

[elasticsearch-data]
elk-data1.mydomain.com
elk-data2.mydomain.com
elk-data3.mydomain.com

[kibana]
kibana.mydomain.com

And define the roles for the servers in another YAML config file – the so called playbook:

- hosts: elasticsearch-master
  roles:
    - ansible-elasticsearch

- hosts: elasticsearch-data
  roles:
    - ansible-elasticsearch

- hosts: logstash
  roles:
    - ansible-logstash

- hosts: kibana
  roles:
    - ansible-kibana

Each group of servers may have their own files containing configuration variables.

elasticsearch_version: 2.1.0
elasticsearch_node_master: false
elasticsearch_heap_size: 1000G

Ansible is used for configuring the ELK stack vagrant box at https://github.com/comperiosearch/vagrant-elk-box-ansible, which was recently upgraded with Elasticsearch 2.1, Kibana 4.3 and Logstash 2.1

The same set of Ansible roles can be applied when the configuration needs to move into production, by applying another set of variable files with modified host names, certificates and such. The possible ways to do this are several.

How does it work?

Ansible is agent-less. This means, you do not install anything (an agent) on the machines you control. Ansible needs only to be installed on the controlling machine (Linux/OSX) and connects to the managed machines (some support for windows, even) using SSH. The only requirement on the managed machines is python.

Happy ansibling!

Analysing Solr logs with Logstash

Seb Muller — Sun, 20 Sep 2015 22:00:00 +0000

Analysing Solr logs with Logstash

Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you’re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK stack, both Christoffer and André have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.

As a little side note for the truly devoted Solr users, an ELK stack alternative exists with SiLK. I highly recommend checking out Lucidworks’ various blog posts on Solr and search in general.

Some background

On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio’s search middleware application.

Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.

Lets get started

I’m going to assume you already have a running Solr installation. You will, however, need to download Elasticsearch and Logstash and unpack them. Before we start Elasticsearch, I recommend installing these plugins:

Head
Marvel

Head is a cluster health monitoring tool. Marvel we’ll only need for the bundled developer console, Sense. To disable Marvel’s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml

marvel.agent.enabled: false

Start elasticsearch with this command:

~/elasticsearch-[version]/bin/elasticsearch

Navigate to http://localhost:9200/ to confirm that Elasticsearch is running. Check http://localhost:9200/_plugin/head and http://localhost:9200/_plugin/marvel/sense/index.html to verify the plugins installed correctly.

The anatomy of a Logstash config

Update 21/09/15

I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: The rest of the original article contents are unchanged for comparison’s sake.

All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:

input {
  file {
    path => "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host => "localhost"
    template => "~/logstash/bin/logstash_solr_template.json"
    index => "solr-%{+YYYY.MM.dd}"
    template_overwrite => true
  }

This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the input and output plugins’ documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used here.

To process the Solr logs, we’ll use the grok, mutate, multiline, drop and kv filter plugins.

Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the grok debugger app is particularly helpful. Be mindful though that some of the escaping syntax isn’t always the same in the app as what the Logstash config expects.
We need the multiline plugin to link stacktraces to their initial error message.
The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events
We use mutate to add and remove tags along the way.
And finally, drop to drop any events we don’t want to keep.

The hard fun part

Lets dive into the filter stage now. Take a look at the config file I’m using. The Grok patterns may appear a bit daunting, especially if you’re not very familiar with regexp and the default Grok patterns, but don’t worry! Lets break it down.

The first section extracts the log event’s severity and timestamp into their own fields, ‘level’ and ‘LogTime’:

grok {
    match => { "message" => "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure => []
  }

So, given this line from my example log file

INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.id=epifile_211278&literal.epifileid_s=211278&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&literal.filesource_s=SiteFile} {} 0 65

We’d extract

{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}

In the template file I linked earlier, you’ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we’d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you’ll notice I use

tag_on_failure => []

in most of my Grok stages. The default value is “_grokparsefailure”, which I don’t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.

The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.

# Combine commit events into single message
  multiline {
      pattern => "^\t(commit\{)"
      what => "previous"
    }

Now we come to a major section for handling general INFO level messages.

# INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure => []  
    }
    if [params] {
      kv {
        field_split => "&"
        source => "params"
      }
    } else {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure => [ "drop" ]
        add_field => {
          "action" => "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }

This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document’s extracted contents look like when stored in Elasticsearch:

{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&literal.epi_file_title=A05100_Tass5+Trondheim.pdf&literal.title=A05100_Tass5+Trondheim.pdf&literal.id=epifile_211027&literal.epifileid_s=211027&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}

If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with “drop”. Finally, any messages tagged with “drop” are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:

.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}

The next section handles ERROR level messages:

# Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern => "^\s"
      what => "previous"
      add_tag => [ "multiline_pre" ]
    }
    multiline {
        pattern => "^Caused by"
        what => "previous"
        add_tag => [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure => []
      }
    }
  }

Given a stack trace, there’s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.

Finally, I drop any empty lines and clean up temporary tags:

# Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag => [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }

To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:

# aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}

You should get back all your processed log events along with an aggregation on event severity.

Conclusion

Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I’ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.

Analyzing web server logs with Elasticsearch in the cloud

Christoffer Vig — Tue, 26 May 2015 21:12:34 +0000

Using Logstash and Kibana on Found by Elastic, Part 1

This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and securing connections. Part 2 will show how to configure Logstash to read from IIS log files, and how to use Kibana 4 to visualize web traffic. Originally published on the Elastic Blog

Getting the Bits

For this demo I will be running Logstash and Kibana from my Windows laptop.
If you want to follow along, download and extract Logstash 1.5.RC4 or later, and Kibana 4.0.2 or later from https://www.elastic.co/downloads.

Creating an Elasticsearch Cluster

Creating a new trial cluster in Found is just a matter of logging in and pressing a button. It takes a few seconds until the cluster is ready, and a screen with some basic information on how to connect pops up. We need the address for the HTTPS endpoint, so copy that out.

Configuring Logstash

Now, with the brand new SSL connection option in Logstash, connecting to Found is as simple as this Logstash configuration

input { stdin{} }

output {
  elasticsearch {
    protocol => http
    host => REPLACE_WITH_FOUND_CLUSTER_HOSTNAME
    port => "9243" # Check the port also
    ssl => true
  }

  stdout { codec => rubydebug }
}

Save the file as found.conf

Start up Logstash using

bin\logstash.bat agent --verbose -f found.conf

You should see a message similar to

Create client to elasticsearch server on `https://....foundcluster.com:9243`: {:level=>:info}

Once you see “Logstash startup completed” type in your favorite test term on the terminal. Mine is “fisk” so I type that.
You should see output on your screen showing what Logstash intends to pass on to elasticsearch.

We want to make sure this actually hits the cloud, so open a browser window and paste the HTTPS link from before, append /_search to the URL and hit enter.
You should now see the search results from your newly created Elasticsearch cluster, containing the favorite term you just typed in. We have a functioning connection from Logstash on our machine to Elasticsearch in the cloud! Congratulations!

Configuring Kibana 4

Kibana 4 comes with a built-in webserver. The configuration is done in a kibana.yml file in the config directory. Connecting to Elasticsearch in the cloud comes down to inserting the address of the Elasticsearch instance.

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: `https://....foundcluster.com:9243`

Of course, we need to verify that this really works, so we open up Kibana on http://localhost:5601, select the Logstash index template, with the @timestamp data field as suggested, and open up the discover panel. Now, if there was less than 15 minutes since you inserted your favorite test term in Logstash (previous step), you should see it already. Otherwise, change the date range by clicking on the selector in the top right corner.

Locking it down

Found by Elastic has worked hard to make the previous steps easy. We created an Elasticsearch cluster, fed data into it and displayed in Kibana in less than 5 minutes. We must have forgotten something!? And yes, of course! Something about security. We made sure to use secure connections with SSL, and the address generated for our cluster contains a 32 character long, randomly generated list of characters, which is pretty hard to guess. Should, however, the address slip out of our hands, hackers could easily delete our entire cluster. And we don’t want that to happen. So let’s see how we can make everything work when we add some basic security measures.

Access Control Lists

Found by Elastic has support for access control lists, where you can set up lists of usernames and passwords, with lists of rules that deny/allow access to various paths within Elasticsearch. This makes it easy to create a “read only” user, for instance, by creating a user with a rule that only allows access to the /_search path. Found by Elastic has a sample configuration with users searchonly and readwrite. We will use these as starting point but first we need to figure out what Kibana needs.

Kibana 4 Security

Kibana 4 stores its configuration in a special index, by default named “.kibana”. The Kibana webserver needs write access to this index. In addition, all Kibana users need write access to this index, for storing dashboards, visualizations and searches, and read access to all the indices that it will query. More details about the access demands of Kibana 4 can be found on the elastic blog.

For this demo, we will simply copy the “readwrite” user from the sample twice, naming one kibanaserver, the other kibanauser.

Setting Access Control in Found:
# Allow everything for the readwrite-user, kibanauser and kibanaserver
- paths: ['.*']
conditions:
- basic_auth:
users:
- readwrite
- kibanauser
- kibanaserver
- ssl:
require: true
action: allow

Press save and the changes are immediately effective. Try to reload the Kibana at http://localhost:5601, you should be denied access.

Open up the kibana.yml file from before and modify it:

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibanaserver
kibana_elasticsearch_password: `KIBANASERVER_USER_PASSWORD`

Stop and start Kibana to effectuate settings.
Now when Kibana starts up, you will be presented with a login box for HTTP authentication.
Type in kibanauser as the username, and the password . You should now again be presented with the Discover screen, showing the previously entered favorite test term. Again, you may have to expand the time range to see your entry.

Logstash Security

Logstash will also need to supply credentials when connecting to Found by Elastic. We reuse permission from the readwrite user once again, this time giving the name “logstash”.
It is simply a matter of supplying the username and password in the configuration file.

output {
  elasticsearch {
    ….
    user => “logstash”,
    password => `LOGSTASH_USER_PASSWORD`
  }
}

Wrapping it up

This has been a short dive into Logstash and Kibana with Found by Elastic. The recent changes done in order to support the Shield plugin for Elasticsearch, Logstash and Kibana, make it very easy to use the secure features of Found by Elastic. In the next post we will look into feeding logs from IIS into Elasticsearch via Logstash, and visualizing the most used query terms in Kibana.

Ny versjon av Comperio FRONT.NET

Christoffer Vig — Wed, 13 May 2015 10:24:54 +0000

Comperio har gjennom tidenes løp levert over 100 søkeprosjekter. Tankegods, svette og erfaringer hentet fra dette arbeidet har krystallisert seg inn i vår egentuviklede programvare for søkeapplikasjoner: FRONT. Tidligere i vår lanserte vi versjon 5 av Java-FRONT, denne gang er det den noe yngre fetteren Comperio FRONT.NET som har fått ligge på operasjonsbordet. Hovedtrekkene i den nye versjonen er nye søkeadaptere, forbedret stabilitet og ytelse, samt forbedret logging.

Mellomvare for søk

FRONT.NET opererer som mellomvare, og lar deg konfigurere forretningslogikk for søk uavhengig av både søkemotor og presentasjon. FRONT.NET er laget for å kunne hente og sette sammen informasjon fra ulike kilder, og kan gjerne kalles en søkeorkestrator.

FRONT.NET lar deg skille mellom forretningslogikk og applikasjonslogikk. Applikasjoner som trenger søkefunksjonalitet trenger ikke bry seg med kompliserte søkeuttrykk, men sender simpelthen spørreord over til FRONT.NET. Trenger du å avgrense søket, kan du sende med filter, som for eksempel brukerinformasjon, sted, avdeling, eller lignende. De komplekse spørringene tar FRONT seg av.

Søkemotoruavhengighet

FRONT.NET tilbyr et generelt format for spørringer, og søkeresultater. Dataformatet fra FRONT er det samme, uavhengig av om motoren i bakkant er SharePoint, ESP, eller Solr. FRONT.NET har i dag adaptere for Fast ESP, SharePoint 2010 og 2013, Elasticsearch, Solr og Google Search Appliance. Dette gjør det enkelt å sette sammen resultater fra ulike søkemotorer. Dersom du ønsker å bytte ut søkemotoren trenger det ikke innebære endringer i din applikasjon, da det kun er snakk om å bytte ut søkeadapter i FRONT.NET. Nye adaptere utvikles så snart vi ser behovet melde seg.

Elasticsearch adapter

Elasticsearch er en søkemotor i stor vekst. Til utvikling av Elasticsearch adapteret har vi kunnet dra nytte av NEST, den offisielle .NET klienten for Elasticsearch. Elasticsearch har enorm fleksibilitet i forhold til hvordan spørringer kan uttrykkes, med mulighet for nestede boolske uttrykk og dynamiske ranking-funksjoner. I utvikling av adapteret har vi valgt å minimere kompleksiteten i FRONT ved å delegere disse mulighetene inn i Elasticsearch via søkemaler (search templates). Dette ivaretar fleksibiliteten, samtidig som APIer og programmeringsgrensesnittene er beholdt.

Google Search Appliance Adapter

Comperio ble ifjor partner med Google, og vi har nå utviklet FRONT.NET adapter for Googles intranett søkemotor Google Search Appliance, eller bare GSA for kort. GSA tilbyr enkel integrasjon mot en rekke ulike kilder, søkegrensesnittet er enkelt og forholde seg til og adapteret har støtte for alle vanlige søkeoperasjoner.

Logging

For å kunne utvikle en god søkeløsning er det avgjørende at man har tilgang til gode søkelogger som avslører hvordan søkeapplikasjon brukes.
FRONT.NET har nylig fått funksjonalitet for å kunne logge direkte til Logstash. Logstash kombinert med Elasticsearch og Kibana gir deg et kraftig verktøy for dataanalyse.

FRONTD

Versjon 5 av FRONT.NET kjører som en frittstående tjeneste i Windows.
Tidligere versjoner opererte som web applikasjon under IIS (Internet information server), men vi ser at når vi kjører frittstående oppnås forenklet administrasjon, samt forbedret stabilitet og ytelse.

Microsoft, .NET og veien videre

Microsoft og .NET verdenen er under rivende utvikling for tiden, ikke minst gjennom Microsoft sin nye og kjært velkomne åpning mot open source. Vi liker veldig godt ideen om kryssplattform .NET, og neste versjon av FRONT.NET vil forhåpentligvis kjøre like bra på OS X og Linux som på Microsoft.

How to develop Logstash configuration files

Christoffer Vig — Fri, 10 Apr 2015 12:06:17 +0000

Installing logstash is easy. Problems arrive only once you have to configure it. This post will reveal some of the tricks the ELK team at Comperio has found helpful.

Write configuration on the command line using the -e flag

If you want to test simple filter configurations, you can enter it straight on the command line using the -e flag.

bin\logstash.bat  agent  -e 'filter{mutate{add_field => {"fish" => “salmon”}}}'

After starting logstash with the -e flag, simply type your test input into the console. (The defaults for input and output are stdin and stdout, so you don’t have to specify it. )

Test syntax with –configtest

After modifying the configuration, you can make logstash check correct syntax of the file, by using the –configtest (or -t) flag on the command line.

Use stdin and stdout in the config file

If your filter configurations are more involved, you can use input stdin and output stdout. If you need to pass a json object into logstash, you can specify codec json on the input.

input { stdin { codec => json } }

filter {
    if ![clicked] {
        mutate  {
            add_field => ["clicked", false]
        }
    }
}

output { stdout { codec => json }}

Use output stdout with codec => rubydebug

Using codec rubydebug prints out a pretty object on the console

Use verbose or –debug command line flags

If you want to see more details regarding what logstash is really doing, start it up using the –verbose or –debug flags. Be aware that this slows down processing speed greatly!

Send logstash output to a log file.

Using the -l “logfile.log” command line flag to logstash will store output to a file. Just watch your diskspace, in particular in combination with the –verbose flags these files can be humongous.

When using file input: delete .sincedb files. in your $HOME directory

The file input plugin stores information about how far logstash has come into processing the files in .sincedb files in the users $HOME directory. If you want to re-process your logs, you have to delete these files.

Use the input generate stage

You can add text lines you want to run through filters and output stages directly in the config file by using the generate input filter.

input {
  generator{
    lines => [
      '{"@message":"fisk"}',
      '{"@message": {"fisk":true}}',
      '{"notMessage": {"fisk":true}}',
      '{"@message": {"clicked":true}}'
      ]
    codec => "json"
    count => 5
  }
}

Use mutate add_tag after each successful stage.

If you are developing configuration on a live system, adding tags after each stage makes it easy to search up the log events in Kibana/Elasticsearch.

filter {
  mutate {
    add_tag => "before conditional"
  }
  if [@message][clicked] {
    mutate {
      add_tag => "already had it clicked here"
    }
  } else {
      mutate {
        add_field  => [ "[@message][clicked]", false]
    }
  }
  mutate {
    add_tag => "after conditional"
  }
}

Developing grok filters with the grok debugger app

The grok filter comes with a range of prebuilt patterns, but you will find the need to develop your own pretty soon. That’s when you open your browser to https://grokdebug.herokuapp.com/ Paste in a representative line for your log, and you can start testing out matching patterns. There is also a discover mode that will try to figure out some fields for you.

The grok constructor, http://grokconstructor.appspot.com/do/construction offers an incremental mode, which I have found quite helpful to work with. You can paste in a selection of log lines, and it will offer a range of possibilities you can choose from, trying to match one field at a time.

SISO

If possible, pre-format logs so Logstash has less work to do. If you have the option to output logs as valid json, you don’t need grok filters since all the fields are already there.

This has been a short runthrough of the tips and tricks we remember to have used. If you know any other nice ways to develop Logstash configurations, please comment below.

Replacing FAST ESP with Elasticsearch at Posten

Seb Muller — Fri, 20 Mar 2015 10:00:52 +0000

First, some background

A few years ago Comperio launched a nifty service for Posten Norge, Norway’s postal service. Through the service, retail companies can upload their catalogues and seasonal flyers to make the products listed within searchable. Although the catalogue handling and processing is also very interesting, we’re going to focus on the search side of things in this post. As Comperio has a long relationship and a great deal of experience with FAST ESP, this first iteration of Posten’s Tilbudssok used it as the search backend. It also incorporated Comperio Front, our search middleware product, which recently had a big release. .

Newer is better

Unfortunately, FAST ESP is getting on a bit and as a result Tilbudssok has been limited by what we can coax out of it. To ensure we provide the best possible search solution we decided it was time to upgrade and chose Elasticsearch as the best candidate. If you are unfamiliar with Elasticsearch, take a moment to browse our other blog posts on the subject. The resulting project had three main requirements:

Replace Fast ESP with Elasticsearch while otherwise maintaining as much of the existing architecture as possible
Add geodata to products such that a user could find the nearest store where they were available
Setup sexy log analysis with Logstash and Kibana

Data Sources, Ingestion and Processing

The data source for the search system is a MySQL database populated with catalogue and product data. A separate Comperio system generates this data when Posten’s customers upload PDFs of their brochures i.e. we also fully own the entire data generation process.

The FAST ESP based solution made use of FAST’s JDBC connector to feed data directly to the search index. Inspired by Christoffers blog post, we made use of the JDBC plugin for Elasticsearch. This allowed us to use the same SQL statements to feed Elasticsearch. It took us no more than a couple of hours, including some time wrestling with field mappings, to populate our Elasticsearch index with the same data as the FAST one.

We then needed to add store geodata to the index. As mentioned earlier, we completely own the data flow. We simply extended our existing catalogue/product uploader system to include a store uploader service. Google’s geocoder handled converted addresses to coordinates for use with Elasticsearch’s geo distance sorting. We now had store data in our database. An extra JDBC river and another round of mapping wrestling got that same data into the Elasticsearch index.

Our approach

Before the conversion to Elasticsearch, the Posten system architecture was typical of most Comperio projects. Users interact with a Java based frontend web application. This in turn sends queries to Comperio’s search abstraction layer, Comperio Front. This formats requests such that the system’s search engine, in our case FAST ESP, can understand them. Upon receiving a response from the search engine, Front then formats it into a frontend friendly format i.e. JSON or XML depending on developer preference.

Unfortunately, when we started the project, Front’s Elasticsearch adapter was still a bit immature. It also felt a bit over kill to include it when Elasticsearch has such a robust Java API already. I saw an opportunity to reduce the system’s complexity and learn more about interacting with Elasticsearch’s Java API and took it. With what I learnt, we could later beef up Front’s Elasticsearch adapter for future projects.

As a side note, we briefly flirted with the idea of replacing the entire frontend with a hipstery Javascript/Node.js ecosystem. It was trivial to throw together a working system very quickly but in the interest of maintaining existing architecture and trying to keep project run time down we opted to stick with the existing Java based MVC framework.

After a few rounds of Googling, struggling with documentation and finally simply diving into the code, I was able to piece together the bits of the Elasticsearch Java API puzzle. It is a joy to work with! There are builder classes for pretty much everything. All of our queries start with a basic SearchRequestBuilder. Depending on the scenario, we can then modify this SRB with various flavours of QueryBuilders, FilterBuilders, SortBuilders and AggregationBuilders to handle every potential use case. Here is a greatly simplified example of a filtered search with aggregates:

Logstash and Kibana

With our Elasticsearch based system up ready to roll, the next step was to fulfil our sexy query logging project requirement. This raised an interesting question. Where are the query logs? As it turns out, (please contact us if we’re wrong), the only query logging available is something called slow logging. It is a shard level log where you can set thresholds for the query or fetch phase of the execution. We found this log severely lacking in basic details such as hit count and actual query parameters. It seemed like we could only track query time and the query string.

Rather than fight with this slow log, we implemented our own custom logger in our web app to log salient parts of the search request and response. To make our lives easier everything is logged as JSON. This makes hooking up with Logstash trivial, as our logstash config reveals:

Kibana 4, the latest version of Elastic’s log visualisation suite, was released in February, around the same time as we were wrapping up our logging logic. We had been planning on using Kibana 3, but this was a perfect opportunity to learn how to use version 4 and create some awesome dashboards for our customer:

Kibana 4 is wonderful to work with and will generate so much extra value for Posten and their customers.

Conclusion

Although the Elasticsearch Java API itself is well rounded and complete, its documentation can be a bit frustrating. But this is why we write blog posts to share our experiences!
Once we got past the initial learning curve, we were able to create an awesome Elasticsearch Java API toolbox
We were severely disappointed with the built in query logging. I hope to extract our custom logger and make it more generic so everyone else can use it too.
The Google Maps API is fun and super easy to work with

Rivers as a data ingestion tool have long been marked for deprecation. When we next want to upgrade our Elasticsearch version we will need to replace them entirely with some other tool. Although Logstash is touted as Elasticsearch’s main equivalent of a connector framework, it currently lacks classic Enterprise search data source connectors. Apache Manifold is a mature open source connector framework that would cover our needs. The latest release has not been tested with the latest version of Elasticsearch, but it supports versions 1.1-3.

Once the solution goes live, during April, Kibana will really come into its own as we get more and more data.

Elastic{ON}15: Day one

Christoffer Vig — Wed, 11 Mar 2015 16:07:48 +0000

March 10, 2015

At Comperio we have been speculating for a while now that Elasticsearch might just drop search from their name. With Elasticsearch spearheading the expansion of search into analytics and all sorts of content and data driven applications such a change made sense to us. What the name would be we had no idea about however – ElasticStash, KibanElastic StashElasticLog – none of these really rolled of the tongue like a proper brand.

More surprising is the Elasticsearch move into the cloud space by acquiring Found. A big and heartfelt congratulations to our Norwegian colleagues from us at Comperio. Found has built and delivered an innovative and solid product and we look forward to seeing them build something even better as a part of Elastic.

Elasticsearch is renamed to Elastic, and Found is no longer just Found, but Found by Elastic. The opening keynote held by CEO Steven Shuurman and Shay Banon was a tour of triumph through the history of Elastic, detailing how the company has grown sort of in an organic, natural manner, into what it is today. Kibana and Logstash started as separate projects but were soon integrated into Elastic. Shay and Steven explained how old roadmaps for the development of Elastic included plans to create CloudES, search as a cloud service. CloudES was never created, due to all the other pressing issues. Simultaneously, the Norwegian company Found made great strides with their cloud search offering, and an acquisition became a very natural fit.

Elastic{ON} is the first conference devoted entirely to the Elastic family of products. The sessions consist on one hand of presentations by developers and employees of Elastic, on the other there is “ELK in the wild” showcasing customer use cases, including Verizon, Github, Facebook and more.

On day one the sessions about core elasticsearch, Lucene, Kibana and Logstash were of particular interest to us.

Elasticsearch

The session about “Recent developments in elasticsearch 2.0” held by Clinton Gormley and Simon Wilnauer revealed a host of interesting new features in the upcoming 2.0 release. There is a very high focus on stability, and making sure that no releases should contain bugs. To illustrate this Clinton showed graphs relating the number of lines of code compared to lines of tests, where the latter was rising sharply in the latest releases. It was also interesting to note that the number of lines of code has been reduced recently due to refactoring and other improvements to the code base.

Among interesting new features are a new “reducer” step for aggregations allowing calculations to be done on top of aggregated results and a Changes API which helps managing changes to the index. The Changes API will be central in creating other features, for example update by query, where a typical use case involves logging search results, where the changes API will allow updating information about click activity in the same log entry as the one containing the query.

There will also be a Reindex API that simplifies the development cycle when you have to refeed an entire index because you need to change a mapping or field type.

Kibana

Rashid Khan went through the motivations behind the development of Kibana 4, where support for aggregations, and making the product easier to work with and extendable really makes this into a fitting platform for creating tools for creating visualizations of data. Followed by “The Contributor’s Guide to the Kibana Galaxy” by Spencer Alger who demoed how to setup the development environment for Kibana 4 using using npm, grunt and bower- the web development standard toolset of today ( or was it yesterday?)

Logstash

Logstash creator Jordan Sissel presented the new features of Logstash 1.5, and what to expect in future versions. 1.5 introduces a new plugin system, and to great relief of all windows users out there the issues regarding file locking on rolling log files have been resolved! The roadmap also aims to vastly improve the reliability of Logstash, no more losing documents in planned or unplanned outages. In addition there are plans to add event persistence and various API management tools. As a consequence of the river technology being deprecated, Logstash will take the role as document processing framework that those of us who come from FAST ESP have missed for some time now. So in effect, all rivers, (including JDBC) will be ported to Logstash.

Aggregations

Mark Harwood presented a novel take on optimizing index creation for aggregations in the session “Building Entity Centric Indexes”. You may have tried to run some fancy aggregations,only to have elasticsearch dying from out of memory errors. Avoiding this often takes some insight into the architecture to
structure your aggregations in the best possible manner. Mark essentially showed how to move some of the aggregation to indexing time rather than query time. The original use case was a customer who needed to know what is the average session length for the users of his website. Figuring that out involved running through the whole index, sorting by session id, picking the timestamp of the first item and subtracting from the second, a lot of operations with an enormous consumption of resources. Mark approaches the problems in a creative and mathematical manner, and it is always inspiring to attend his presentations. It will be interesting to see whether the Changes API mentioned above will deliver functionality that can be used to improve aggregated data.

.NET

Deep dive into the .NET clients with Martijn Laarman showed how to use a strongly typed language as C# with elasticsearch. Yes, it is actually possible, and it looked very good. There is a low-level client that just connects to the api where you have to to do all the parsing yourself, and a high-level client called NEST building on top of that offering a strongly typed query DSL having almost 1 to 1 mapping to the elasticsearch dsl. Particularly nifty was the covariant result handling, where you can specify the type of results you need back, considering a search result from elasticsearch can contain many types.

Looking forwards to day 2!

ELK in one (Vagrant) box

Murhaf Fares — Thu, 14 Aug 2014 14:06:18 +0000

In this blog post we introduce a Vagrant box to easily create configurable and reproducible development environments for ELK (Elasticsearch, Logastash and Kibana). At Comperio, we mainly use this box for query log analysis using the ELK stack.
In case you don’t know, Vagrant is a free and open-source software that combines VirtualBox (a virtualization software) with configuration management softwares such as Puppet and Chef.

ELK stack up and running in two commands

$ git clone https://github.com/comperiosearch/vagrant-elk-box.git
$ vagrant up

By cloning this github repo and then typing “vagrant up”, you will be installing elasticsearch, logstash, kibana and nginx (the latter used to serve kibana).

Elasticsearch will be running on port 9200, as usual, which is forwarded to the host machine. As for Kibana, it will be served on port 5601 (also accessible from the host OS).

How does it work?
As mentioned above, Vagrant is a wrapper around VirtualBox and some configuration management software. In our box, we use pure shell scripting and Puppet to configure the ELK stack.
There are two essential configuration files in this box: Vagrantfile and the Puppet manifest default.pp.
Vagrantfile includes the settings of the virtual box such as operating system, memory size, number of CPUs, forwarded ports, etc…

Vagrantfile also includes a shell script that installs, among other things, the official Puppet modules for elasticsearch and logstash. By using that shell script we stay away from git submodules which were used in another Vagrant image we made earlier for elasticsearch.

In the Puppet manifest, default.pp, we define what version of elasticsearch to install and make sure that it is running as a service.

We do the same for logstash and additionally link the default logstash configuration file to this file under /Vagrant/confs/logstash which is shared with the host OS. Finally, we install nginx and Kibana, and configure Kibana to run on port 5601 (by linking the nginx conf file to this file in the Vagrant directory also).

SharePoint ULS log analysis using ELK

Madalina Rogoz — Fri, 01 Aug 2014 11:31:06 +0000

E is for Elasticsearch

Elasticsearch is an open source search and analytics engine that extends the limits of full-text search through a robust set of APIs and DSLs, to deliver a flexible and almost limitless search experience.

L is for Logstash

One of the most popular open source log parser solutions on the market, Logstash has the possibility of reading any data source and extracting the data in JSON format, easy to use and running in minutes.

K is for Kibana

A data visualization engine, Kibana allows the user to create custom dashboards and to analyze Elasticsearch data on-the-fly and in real-time.

Getting set up

To start using this technology, you just need to install the three above mentioned components, which actually means downloading and unzipping three archive files.

The data flow is this: the log files are text files residing in a folder. Logstash will use a configuration file to read from the logs and parse all the entries. The parsed data will be sent to Elasticsearch for storing. Once here, it can be easily read and displayed by Kibana.

Parsing SharePoint ULS log files with Logstash

We will now focus on the most simple and straightforward way of getting this to work, without any additional configuration or settings. Our goal is to open Kibana and be able to configure some charts that will help us visualize and explore what type of entries we have in the SharePoint ULS logs, and to be able to search the logs for interesting entries.

To begin, we need some ULS log files from SharePoint that will be placed in a folder on the server (I am working on a Windows Server virtual environment) where we are testing the ELK stack. My ULS logs are located here: C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\14\LOGS

As an example, the first line in one of my log files looks like this:

05/06/2014 10:20:20.85 wsstracing.exe (0x0900)                 0x0928SharePoint Foundation         Tracing Controller Service    5152InformationTracing Service started.

The next step is to build the configuration file. This is a text file with a .config extension, located by default in the Logstash folder. The starting point for the content of this file would be:

input {  
 file {  
  type => "sharepointlog" 
    path => ["[folder where the logs reside]/*.log"] 
   start_position => "beginning" 
   codec => "plain" 
} 
} 
filter  
{ 
 } 
output  
{    
 elasticsearch {  
embedded => true 
 } 
}

The Input defines the location of the logs and some reading parameters, like the starting position where Logstash will begin parsing the files. The Output section defines the location of the parsed data, in our case the Elasticsearch instance installed on the same server.

Now for the important part, the Filter section. The Filter section contains one or more GROK patterns that are used by Logstash for identifying the format of the log entries. There are many types of entries, but we are focusing on the event type and message, so we have to parse all the parameters up to the message part in order to get what we need.

The documentation is pretty detailed when it comes to GROK and a pattern debugger website with a GROK testing engine is available online, so you can develop and test your patterns before actually running them in Logstash.

So this is what I came up with for the SharePoint ULS logs:

filter { 
   if [type] == "sharepointlog" { 
grok { 
match => [ "message",  
"(?%{MONTHNUM}/%{MONTHDAY}/%{YEAR} %{HOUR}:%{MINUTE}:%{SECOND}) \t%{DATA:process} \(%{DATA:processcode}\)(\s*)\t%{DATA:tid}(\s*)\t(?.*)(\s*)\t(?.*)(\s*)\t%{WORD:eventID}(\s*)\t%{WORD:level}(\s*)\t%{DATA:eventmessage}\t%{UUID:CorrelationID}"] 
match => [ "message",  
"(?%{MONTHNUM}/%{MONTHDAY}/%{YEAR} %{HOUR}:%{MINUTE}:%{SECOND}) \t%{DATA:process} \(%{DATA:processcode}\)(\s*)\t%{DATA:tid}(\s*)\t(?.*)(\s*)\t(?.*)(\s*)\t%{WORD:eventID}(\s*)\t%{WORD:level}(\s*)\t%{DATA:eventmessage}"] 
match => [ "message",  
“(?%{MONTHNUM}/%{MONTHDAY}/%{YEAR} %{HOUR}:%{MINUTE}:%{SECOND})%{GREEDYDATA}\t%{DATA:process} \(%{DATA:processcode}\)(\s*)\t%{DATA:tid}(\s*)\t(?.*)(\s*)\t(?.*)(\s*)\t%{WORD:eventID}(\s*)\t%{WORD:level}(\s*)\t%{DATA:eventmessage}"] 
} 
date { 
match => ["parsedtime","MM/dd/YYYY HH:mm:ss.SSS"] 
} 
   } 
}

Logstash in action

All that’s left to do is to get Logstash going and see what comes out. Run the following on the command line:

logstash.bat agent -f "sharepoint.conf"

This runs logstash as an agent, so it will monitor the file or the folder you specify in the input section of the config for changes. If you are indexing a folder where files appear periodically, you don’t need to worry about restarting the process, it will continue on its own.

Kibana time

Now let’s create a new dashboard in Kibana and see what was indexed. The most straight-forward panel type is Histogram. Make no changes to the default settings of this panel (Chart value = count, Time field = @timestamp) and you should see something similar to this:

To get some more relevant information, we can add some pie charts and let them display other properties that we have mapped, for example ‘process’ or ‘area’.

Now let’s turn this up a notch: through Kibana we can take a look at the errors in the SharePoint logs. Create a pie chart that displays the “level” field. By clicking on the “Unexpected” slice in this chart you will filter all the dashboard on this value.

Kibana will automatically refresh the page, the filter itself will be displayed in the “Filter” row and all you will see are the “Unexpected” events. Time to turn to the help of a Table chart: by displaying the columns you select on the Fields section of this chart, you can view and sort the log entries for a more detailed analysis of the unexpected events.

As the Logstash process runs as an agent, you can monitor the SharePoint events in real-time! So there you have it, SharePoint log analysis using ELK.

Elasticsearch Visits Comperio

Fergus McDowall — Fri, 04 Apr 2014 08:28:04 +0000

Yesterday the legendary Shay Banon, inventor of Elasticsearch and Arie Chapman dropped into to Comperio’s Oslo office on their way to the Oslo Elasticsearch Meetup to talk about whats hot in Elasticsearch v1.x.

Shay gave the team the lowdown on the latest functionality, and Arie outlined interesting cutomers and use-cases. Shay also talked about how Kibana and Logstash can make pushing data in and out of indexes easier.

Comperio is really interested in the opportunities that Elasticsearch opens up for visualizing large datasets, particularly those generated by large distributed electronic systems. We will definitely be following up these opportunities with our customers, and hope to bounce some ideas off of the Elastisearch guys again soon.