Search Nuggets » Christoffer Vig

ELK stack deployment with Ansible

Christoffer Vig — Thu, 26 Nov 2015 09:59:38 +0000

As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way.

Ansible is a free software platform for configuring and managing computers, and I’ve been using it a lot lately to manage the ELK stack. Elasticsearch, Logstash and Kibana.

I can define a list of servers I want to manage in a YAML config file – the so called inventory:

[elasticearch-master]
es-master1.mydomain.com
es-master2.mydomain.com
es-master3.mydomain.com

[elasticsearch-data]
elk-data1.mydomain.com
elk-data2.mydomain.com
elk-data3.mydomain.com

[kibana]
kibana.mydomain.com

And define the roles for the servers in another YAML config file – the so called playbook:

- hosts: elasticsearch-master
  roles:
    - ansible-elasticsearch

- hosts: elasticsearch-data
  roles:
    - ansible-elasticsearch

- hosts: logstash
  roles:
    - ansible-logstash

- hosts: kibana
  roles:
    - ansible-kibana

Each group of servers may have their own files containing configuration variables.

elasticsearch_version: 2.1.0
elasticsearch_node_master: false
elasticsearch_heap_size: 1000G

Ansible is used for configuring the ELK stack vagrant box at https://github.com/comperiosearch/vagrant-elk-box-ansible, which was recently upgraded with Elasticsearch 2.1, Kibana 4.3 and Logstash 2.1

The same set of Ansible roles can be applied when the configuration needs to move into production, by applying another set of variable files with modified host names, certificates and such. The possible ways to do this are several.

How does it work?

Ansible is agent-less. This means, you do not install anything (an agent) on the machines you control. Ansible needs only to be installed on the controlling machine (Linux/OSX) and connects to the managed machines (some support for windows, even) using SSH. The only requirement on the managed machines is python.

Happy ansibling!

Elasticsearch: Shield protected Kibana with Active Directory

Christoffer Vig — Fri, 21 Aug 2015 14:26:45 +0000

Elasticsearch easily stores terabytes of data, but how can you make sure users only see the data they should? This post will explore how to use Shield, a plugin for Elasticsearch, to authenticate users with Active Directory.

Elasticsearch will by default allow anyone access to all data. The Shield plugin allows locking down Elasticsearch using authentication from the internal esusers realm, Active Directory (AD) or LDAP . Using AD, you can map groups defined in your Windows domain to roles in Elasticsearch. For instance, you can allow people in the Fishery department access only to fish-indexes, and give complete control to anyone in the IT department.

To use Shield in production, you have to buy an Elasticsearch subscription, however, you get a 30-day trial when installing the license manager. So let’s hurry up and see how this works out in Kibana.

In this post, we will install Shield and connect to Active Directory (AD) for authentication. After having made sure we can authenticate with AD, we will add SSL encryption everywhere possible. We will add authentication for the Kibana server using the built in authentication realm esusers, and if time allows at the end, we will create two user groups, each with access to its own index, and check how it all looks when accessed in Kibana 4.

Prerequisites

You will need a previously installed Elasticsearch and Kibana. The most recent versions should work, I have used Elasticsearch 1.7 and Kibana 4.1.1 If you need a machine to test on, I can personally recommend the vagrant-elk-box you can find here: The following guide assumes the file locations of the vagrant-elk-box, if you install differently, you will probably know where to look. Ask an adult for help.

For Active Directory, you need to be on a domain that uses Active Directory. That would probably mean some kind of Windows work environment.

Installing Shield

If you’re on the vagrant box you should begin the lesson by entering the vagrant box using the commands

vagrant up
vagrant ssh

Install the license manager

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/license/latest

Install Shield

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/shield/latest

Restart elasticsearch. (service elasticsearch restart)

Check out the logs, you should find some information regarding when your Shield license will expire (logfile location: /var/log/elasticsearch/vagrant-es.log)

Integrating Active Directory

The next step involves figuring out a thing or two about your Active Directory configuration. First of all you need to know the address. Now you need to be on your windows machine, open cmd.exe and type

set LOGONSERVER

The name of your AD should pop back. Add a section similar to the following into the elasticsearch.yml file (at /etc/elasticsearch/elasticsearch.yml)

shield.authc.realms:
  active_directory:
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

Type in the address to your AD in the url: field (where it says url: ldap://ad.superdomain.com). If your logonserver is ad.cnn.com, you should type in url: ldap://ad.cnn.com

Also, you need to figure out your domain name and type it in correctly.

NB: Be careful with the indenting! Elasticsesarch cares a lot about correct indenting, and may even refuse to start without telling you why if you make a mistake.

Finding the Correct name for the Active Directory group

Next step involves figuring out the name for the Group you wish to grant access to. You may have called your group “Fishermen”, but that is probably not exactly what it’s called in AD.

Microsoft has a very simple and nice tool called Active Directory Explorer . Open the tool and enter the adress you just found from the LOGONSERVER (remember? it’s only 10 lines above)

You may have to click and explore a little to find the groups you want. Once you find it, you need the value for the “distinguishedName” attribute. You can double click on it and copy out from the “Object”.

This is an example from my AD

CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com

Now this value represents a group which we want to map to a role in elasticsearch.

Open the file /etc/elasticsearch/shield/role-mapping.yml. It should look similar to this

# Role mapping configuration file which has elasticsearch roles as keys
# that map to one or more user or group distinguished names

#roleA:   this is an elasticsearch role
#  - groupA-DN  this is a group distinguished name
#  - groupB-DN
#  - user1-DN   this is the full user distinguished name
power_user:
  - "CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com"
#user:
# - "cn=admins,dc=example,dc=com" 
# - "cn=John Doe,cn=other users,dc=example,dc=com"

I have uncommented the line with “power_user:” and added a line below containing the distinguishedName from above.

By restarting elasticsearch, anyone in the “Rolle IT” group should now be able to log in (and nobody else (yet)).

To test it out, open http://localhost:9200 in your browser. You should be presented with a login box where you can type in your username/password. In case of failure, check out the elasticsearch logs (at /var/log/elasticsearch/vagrant-es.log).

If you were able to log in, that means Active Directory authentication works. Congratulations! You deserve a refreshment. Some strong coffee, will go down well with the next sections, where we add encrypted communications everywhere we can.

SSL - Elasticsearch

Authentication and encrypted communication go hand in hand. Without SSL, username and password is transferred in plaintext on the wire. For this demo we will use self-signed certificates. Keytool comes with Java, and is used to handle certificates for Elasticsearch. The following command will generate a self-signed certficate and put it in a JKS file named self-signed.jks. (swap out $password with your preferred password)

keytool -genkey -keyalg RSA -alias selfsigned -keystore self-signed.jks -keypass $password -storepass $password -validity 360 -keysize 2048 -dname "CN=localhost, OU=orgUnit, O=org, L=city, S=state, C=NO"

Copy the certificate into /etc/elasticsearch/

Modify /etc/elasticsearch/elasticsearch.yml by adding the following lines:

shield.ssl.keystore.path: /etc/elasticsearch/self-signed.jks
shield.ssl.keystore.password: $password
shield.ssl.hostname_verification: false
shield.transport.ssl: true
shield.http.ssl: true

(use the same password as you used when creating the self-signed certificate )

Restart Elasticsearch again, and watch the logs for failures.

Try to open https://localhost:9200 in your browser (NB: httpS not http)

https://localhost:9200

You should a screen warning you that something is wrong with the connection. This is a good sign! It means your certificate is actually working! For production use you could use your own CA or buy a proper certificate, which both will avoid the ugly warning screen.

SSL – Active directory

Our current method of connecting to Active Directory is unencrypted – we need to enable SSL for the AD connections.

1. Fetch the certificate from your Active Directory server (replace ldap.example.com with the LOGONSERVER from above)

echo | openssl s_client -connect ldap.example.com:6362>/dev/null| openssl x509 > ldap.crt

2. Import the certificate into your keystore (located at /etc/elasticsearch/)

keytool -import-keystore self-signed.jks -file ldap.crt

3. Modify AD url in elasticsearch.yml
change the line

url: ldap://ad.superdomain.com

url: ldaps://ad.superdomain.com

Restart elasticsearch and check logs for failures

Kibana authentication with esusers

With Elasticsearch locked down by Shield, it means no services can search or post data either. Including Kibana and Logstash.

Active Directory is great, but I’m not sure I want to use it for letting the Kibana server talk to Elasticsearch. We can use the Shield built in user management system, esusers. Elasticsearch comes with a set of predefined roles, including roles for Logstash, Kibana4 server and Kibana4 user. (/etc/elasticsearch/shield/role-mapping.yml on the vagrant-elk box if you’re still on that one).

Add a new kibana4_server user, granting it the role kibana4_server, using this command:

cd /usr/share/elasticsearch/bin/shield  
./esusers useradd kibana4_server -p secret -r kibana4_server

Adding esusers realm

The esusers realm is the default one, and does not need to be configured if that’s the only realm you use. Now since we added the Active Directory realm we must add another section to the elasticsearch.yml file from above.

It should end up looking like this

shield.authc.realms:
  esusers:
    type: esusers
    order: 0
  active_directory:
    order: 1
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

The order parameter defines in what order elasticsearch should try the various authentication mechanisms.

Allowing Kibana to access Elasticsearch

Kibana must be informed of the new user we just created. You will find the kibana configuration file at /opt/kibana/config/kibana.yml.

Add in the username and password you just created. You also need to change the address for elasticsearch to using https

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: "https://localhost:9200"

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibana4_server
kibana_elasticsearch_password: secret

Restart kibana and elasticsearch, and watch the logs for any errors. Try opening Kibana at http://localhost:5601, type in your login and password. Provided you’re in the group you gave access earlier, you should be able to login.

Creating SSL for Kibana

Once you have enabled authorization for Elasticsearch, you really need to set SSL certificates for Kibana as well. This is also configured in kibana.yml

verify_ssl: false
# SSL for outgoing requests from the Kibana Server (PEM formatted)
ssl_key_file: "kibana_ssl_key_file"
ssl_cert_file: "kibana_ssl_cert_file"

You can create a self-signed key and cert file for kibana using the following command:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes

Configuring AD groups for Kibana access

Unfortunately, this part of the post is going to be very sketchy, as we are desperately running out of time. This blog is much too long already.

Elasticsearch already comes with a list of predefined roles, among which you can find the kibana4 role. The kibana4 role allows read/write access to the .kibana index, in addition to search and read access to all indexes. We want to limit access to just one index for each AD group. The fishery group shall only access the fishery index, and the finance group shall only acess the finance index. We can create roles that limit access to one index by copying the kibana4 role, giving it an appropriate name and changing the index:’*’ section to map to only the preferred index.

The final step involves mapping the Elasticsearch role into an AD role. This is done in the role_mapping.yml file, as mentioned above.

Only joking of course, that wasn’t the last step. The last step is restarting Elasticsearch, and checking the logs for failures as you try to log in.

Securing Elasticsearch

Shield brings enterprise authentication to Elasticsearch. You can easily manage access to various parts of Elasticsearch management and data by using Active Directory groups.

This has been a short dive into the possibilities, make sure to contact Comperio if you should need help in creating a solution with Elasticsearch and Shield.

Impressions from Berlin Buzzwords 2015

Christoffer Vig — Mon, 08 Jun 2015 13:34:53 +0000

May 31 – June 3 2015

Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends.

The conference is focused on three core concepts – search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.

Comperio

Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” – a deep dive into how to utilize Elasticsearch built in indexes and APIs for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.

The talks

Many people attended the comparison of Solr and Elasticsearch Performance & Scalability with Radu Gheorghe & Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.

Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.

SQL?

Another theme threatening to return from the basement was how to properly support SQL style joins into search engines. Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.

Talking the talk

As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.

Hackathon

Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop’s MapReduce component. It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.

The buzz

Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.

Videos from most talks are available at youtube.com

Beyond significant terms

Algorithms and data-structures that power Lucene and Elasticsearch

Practical t-digest Applications

Talk the Talk: How to Communicate with the Non-Coder

Side by Side with Elasticsearch & Solr part 2

Analyzing web server logs with Elasticsearch in the cloud

Christoffer Vig — Tue, 26 May 2015 21:12:34 +0000

Using Logstash and Kibana on Found by Elastic, Part 1

This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and securing connections. Part 2 will show how to configure Logstash to read from IIS log files, and how to use Kibana 4 to visualize web traffic. Originally published on the Elastic Blog

Getting the Bits

For this demo I will be running Logstash and Kibana from my Windows laptop.
If you want to follow along, download and extract Logstash 1.5.RC4 or later, and Kibana 4.0.2 or later from https://www.elastic.co/downloads.

Creating an Elasticsearch Cluster

Creating a new trial cluster in Found is just a matter of logging in and pressing a button. It takes a few seconds until the cluster is ready, and a screen with some basic information on how to connect pops up. We need the address for the HTTPS endpoint, so copy that out.

Configuring Logstash

Now, with the brand new SSL connection option in Logstash, connecting to Found is as simple as this Logstash configuration

input { stdin{} }

output {
  elasticsearch {
    protocol => http
    host => REPLACE_WITH_FOUND_CLUSTER_HOSTNAME
    port => "9243" # Check the port also
    ssl => true
  }

  stdout { codec => rubydebug }
}

Save the file as found.conf

Start up Logstash using

bin\logstash.bat agent --verbose -f found.conf

You should see a message similar to

Create client to elasticsearch server on `https://....foundcluster.com:9243`: {:level=>:info}

Once you see “Logstash startup completed” type in your favorite test term on the terminal. Mine is “fisk” so I type that.
You should see output on your screen showing what Logstash intends to pass on to elasticsearch.

We want to make sure this actually hits the cloud, so open a browser window and paste the HTTPS link from before, append /_search to the URL and hit enter.
You should now see the search results from your newly created Elasticsearch cluster, containing the favorite term you just typed in. We have a functioning connection from Logstash on our machine to Elasticsearch in the cloud! Congratulations!

Configuring Kibana 4

Kibana 4 comes with a built-in webserver. The configuration is done in a kibana.yml file in the config directory. Connecting to Elasticsearch in the cloud comes down to inserting the address of the Elasticsearch instance.

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: `https://....foundcluster.com:9243`

Of course, we need to verify that this really works, so we open up Kibana on http://localhost:5601, select the Logstash index template, with the @timestamp data field as suggested, and open up the discover panel. Now, if there was less than 15 minutes since you inserted your favorite test term in Logstash (previous step), you should see it already. Otherwise, change the date range by clicking on the selector in the top right corner.

Locking it down

Found by Elastic has worked hard to make the previous steps easy. We created an Elasticsearch cluster, fed data into it and displayed in Kibana in less than 5 minutes. We must have forgotten something!? And yes, of course! Something about security. We made sure to use secure connections with SSL, and the address generated for our cluster contains a 32 character long, randomly generated list of characters, which is pretty hard to guess. Should, however, the address slip out of our hands, hackers could easily delete our entire cluster. And we don’t want that to happen. So let’s see how we can make everything work when we add some basic security measures.

Access Control Lists

Found by Elastic has support for access control lists, where you can set up lists of usernames and passwords, with lists of rules that deny/allow access to various paths within Elasticsearch. This makes it easy to create a “read only” user, for instance, by creating a user with a rule that only allows access to the /_search path. Found by Elastic has a sample configuration with users searchonly and readwrite. We will use these as starting point but first we need to figure out what Kibana needs.

Kibana 4 Security

Kibana 4 stores its configuration in a special index, by default named “.kibana”. The Kibana webserver needs write access to this index. In addition, all Kibana users need write access to this index, for storing dashboards, visualizations and searches, and read access to all the indices that it will query. More details about the access demands of Kibana 4 can be found on the elastic blog.

For this demo, we will simply copy the “readwrite” user from the sample twice, naming one kibanaserver, the other kibanauser.

Setting Access Control in Found:
# Allow everything for the readwrite-user, kibanauser and kibanaserver
- paths: ['.*']
conditions:
- basic_auth:
users:
- readwrite
- kibanauser
- kibanaserver
- ssl:
require: true
action: allow

Press save and the changes are immediately effective. Try to reload the Kibana at http://localhost:5601, you should be denied access.

Open up the kibana.yml file from before and modify it:

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibanaserver
kibana_elasticsearch_password: `KIBANASERVER_USER_PASSWORD`

Stop and start Kibana to effectuate settings.
Now when Kibana starts up, you will be presented with a login box for HTTP authentication.
Type in kibanauser as the username, and the password . You should now again be presented with the Discover screen, showing the previously entered favorite test term. Again, you may have to expand the time range to see your entry.

Logstash Security

Logstash will also need to supply credentials when connecting to Found by Elastic. We reuse permission from the readwrite user once again, this time giving the name “logstash”.
It is simply a matter of supplying the username and password in the configuration file.

output {
  elasticsearch {
    ….
    user => “logstash”,
    password => `LOGSTASH_USER_PASSWORD`
  }
}

Wrapping it up

This has been a short dive into Logstash and Kibana with Found by Elastic. The recent changes done in order to support the Shield plugin for Elasticsearch, Logstash and Kibana, make it very easy to use the secure features of Found by Elastic. In the next post we will look into feeding logs from IIS into Elasticsearch via Logstash, and visualizing the most used query terms in Kibana.

Ny versjon av Comperio FRONT.NET

Christoffer Vig — Wed, 13 May 2015 10:24:54 +0000

Comperio har gjennom tidenes løp levert over 100 søkeprosjekter. Tankegods, svette og erfaringer hentet fra dette arbeidet har krystallisert seg inn i vår egentuviklede programvare for søkeapplikasjoner: FRONT. Tidligere i vår lanserte vi versjon 5 av Java-FRONT, denne gang er det den noe yngre fetteren Comperio FRONT.NET som har fått ligge på operasjonsbordet. Hovedtrekkene i den nye versjonen er nye søkeadaptere, forbedret stabilitet og ytelse, samt forbedret logging.

Mellomvare for søk

FRONT.NET opererer som mellomvare, og lar deg konfigurere forretningslogikk for søk uavhengig av både søkemotor og presentasjon. FRONT.NET er laget for å kunne hente og sette sammen informasjon fra ulike kilder, og kan gjerne kalles en søkeorkestrator.

FRONT.NET lar deg skille mellom forretningslogikk og applikasjonslogikk. Applikasjoner som trenger søkefunksjonalitet trenger ikke bry seg med kompliserte søkeuttrykk, men sender simpelthen spørreord over til FRONT.NET. Trenger du å avgrense søket, kan du sende med filter, som for eksempel brukerinformasjon, sted, avdeling, eller lignende. De komplekse spørringene tar FRONT seg av.

Søkemotoruavhengighet

FRONT.NET tilbyr et generelt format for spørringer, og søkeresultater. Dataformatet fra FRONT er det samme, uavhengig av om motoren i bakkant er SharePoint, ESP, eller Solr. FRONT.NET har i dag adaptere for Fast ESP, SharePoint 2010 og 2013, Elasticsearch, Solr og Google Search Appliance. Dette gjør det enkelt å sette sammen resultater fra ulike søkemotorer. Dersom du ønsker å bytte ut søkemotoren trenger det ikke innebære endringer i din applikasjon, da det kun er snakk om å bytte ut søkeadapter i FRONT.NET. Nye adaptere utvikles så snart vi ser behovet melde seg.

Elasticsearch adapter

Elasticsearch er en søkemotor i stor vekst. Til utvikling av Elasticsearch adapteret har vi kunnet dra nytte av NEST, den offisielle .NET klienten for Elasticsearch. Elasticsearch har enorm fleksibilitet i forhold til hvordan spørringer kan uttrykkes, med mulighet for nestede boolske uttrykk og dynamiske ranking-funksjoner. I utvikling av adapteret har vi valgt å minimere kompleksiteten i FRONT ved å delegere disse mulighetene inn i Elasticsearch via søkemaler (search templates). Dette ivaretar fleksibiliteten, samtidig som APIer og programmeringsgrensesnittene er beholdt.

Google Search Appliance Adapter

Comperio ble ifjor partner med Google, og vi har nå utviklet FRONT.NET adapter for Googles intranett søkemotor Google Search Appliance, eller bare GSA for kort. GSA tilbyr enkel integrasjon mot en rekke ulike kilder, søkegrensesnittet er enkelt og forholde seg til og adapteret har støtte for alle vanlige søkeoperasjoner.

Logging

For å kunne utvikle en god søkeløsning er det avgjørende at man har tilgang til gode søkelogger som avslører hvordan søkeapplikasjon brukes.
FRONT.NET har nylig fått funksjonalitet for å kunne logge direkte til Logstash. Logstash kombinert med Elasticsearch og Kibana gir deg et kraftig verktøy for dataanalyse.

FRONTD

Versjon 5 av FRONT.NET kjører som en frittstående tjeneste i Windows.
Tidligere versjoner opererte som web applikasjon under IIS (Internet information server), men vi ser at når vi kjører frittstående oppnås forenklet administrasjon, samt forbedret stabilitet og ytelse.

Microsoft, .NET og veien videre

Microsoft og .NET verdenen er under rivende utvikling for tiden, ikke minst gjennom Microsoft sin nye og kjært velkomne åpning mot open source. Vi liker veldig godt ideen om kryssplattform .NET, og neste versjon av FRONT.NET vil forhåpentligvis kjøre like bra på OS X og Linux som på Microsoft.

3 steg til Big Data

Christoffer Vig — Tue, 28 Apr 2015 13:00:09 +0000

Big data er tidens tredje hotteste buzzword, men ikke alle er klar over hva det er, hvor de kan finne det, eller hva man skal med det. Big Data er i ferd med å vokse frem under beina på de fleste av oss. Det digitale universet fordobles for annet hvert år som går. Internett, mobil og ikke minst tingenes internett genererer stadig mer informasjon.

Skal du lykkes i forretningslivet i dag, er du avhengig av å kjenne brukernes bevegelser og kunne tilpasse løsningen din etter dette. Du kan velge å stole på maktene, som Snåsamannen eller Märtha, eller du kan ta makten i din egen hånd og høste innsikten som ligger begravet i virksomhetens og brukernes logger.

3 steg

Vi tar utgangspunkt i at du har en nettside, og at du får tak i loggene til denne. I tillegg trenger du en datamaskin, samt en datakyndig person, helst en med utvikler-kompetanse.

Slik kommer du i gang:

Identifiser 3 målbare KPI’er.
Forslag: Sidevisninger pr. dag, Mest brukte spørreord, Responstid pr.side
Mat loggene inn i ELK.
Finn logdata og en utvikler. Utvikleren finner lett ut av dette.
Visualisér KPI’ene.
Hold fast i utvikleren, mens dere sammen ser på dataene i Kibana og finner passende grafisk fremstilling.

Eksempel på Kibana dashboard

KPI

Forslagene til KPIer er standard måletall for nettsider. Dette er tall som alle nettsideanalyseverktøy, som Google Analytics, kan gi deg i dag. Forskjellen er at nå er det du som setter sammen grafene og utvikler verktøyene, dataene tilhører deg, og måten du velger å sette informasjonen sammen på for å skape innsikt er helt opp til deg selv. Igjen; Hensikten her er å demonstrere en teknikk og vise fram et verktøy, ikke å fortelle deg hvilke KPIer du bør være opptatt av.

ELK

ELK , som nevnt over, eller den såkalte “ELK stacken”, tilbyr et komplett Big Data lagrings-, søk- og analyse-verktøy. ELK står for Elasticsearch, Logstash og Kibana, en samling open source produkter utviklet av teknologiselskapet Elastic. Søkemotoren Elasticsearch er kjernen i stacken, med fokus på utviklervennlighet og skalerbarhet. Logstash mater data inn i Elasticsearch, mens Kibana tilbyr ad-hoc data-analyse og nydelige visualiseringer og grafer.

Netflix, GitHub, Microsoft er eksempler på gigantvirksomheter som benytter Elasticsearch i kjernen av sin virksomhet.

Bakgrunnen til plattformens popularitet ligger i at den er enkel å starte med, samtidig som den leverer uovertrufne søke- og analyse-muligheter. ELK stacken nevnes ofte i samme åndedrag som Big Data, ettersom den takler større datamengder.

En start

Loggene til nettsiden din kvalifisere antakeligvis ikke helt til betegnelsen Big Data. Poenget er at verktøykassen vi introduserer her står du rustet til større oppgaver.

Du kan kan komme i gang med å ta makten over bedriftens datalogger uten at det krever store ressurser. Planen kan legges underveis, samtidig som enkel tilgang til rådata alene kan skape både ny innsikt og nye spørsmål og behov.

Søk og analyse av store datamengder, som f.eks. transaksjonslogger, nettverkstrafikk, brannmur, internett-aktivitet i stor skale, som twitter, irc, nettsider osv.

Det norske søketeknologiselskapet Comperio er partner med Elastic, og har mange utviklere som du kan hjelpe deg gjennom disse tre stegene. Comperio har jobbet med søk siden 2004 og er et av verdens ledende selskaper innen søketeknologi.

Ikke la Big Data skuta seile sin egen sjø, ta plass ved roret og sett kursen mot din egen Big Data horisont nå!

Les om Comperios frokostmøte om hvordan forstå kundene dine bedre.

How to develop Logstash configuration files

Christoffer Vig — Fri, 10 Apr 2015 12:06:17 +0000

Installing logstash is easy. Problems arrive only once you have to configure it. This post will reveal some of the tricks the ELK team at Comperio has found helpful.

Write configuration on the command line using the -e flag

If you want to test simple filter configurations, you can enter it straight on the command line using the -e flag.

bin\logstash.bat  agent  -e 'filter{mutate{add_field => {"fish" => “salmon”}}}'

After starting logstash with the -e flag, simply type your test input into the console. (The defaults for input and output are stdin and stdout, so you don’t have to specify it. )

Test syntax with –configtest

After modifying the configuration, you can make logstash check correct syntax of the file, by using the –configtest (or -t) flag on the command line.

Use stdin and stdout in the config file

If your filter configurations are more involved, you can use input stdin and output stdout. If you need to pass a json object into logstash, you can specify codec json on the input.

input { stdin { codec => json } }

filter {
    if ![clicked] {
        mutate  {
            add_field => ["clicked", false]
        }
    }
}

output { stdout { codec => json }}

Use output stdout with codec => rubydebug

Using codec rubydebug prints out a pretty object on the console

Use verbose or –debug command line flags

If you want to see more details regarding what logstash is really doing, start it up using the –verbose or –debug flags. Be aware that this slows down processing speed greatly!

Send logstash output to a log file.

Using the -l “logfile.log” command line flag to logstash will store output to a file. Just watch your diskspace, in particular in combination with the –verbose flags these files can be humongous.

When using file input: delete .sincedb files. in your $HOME directory

The file input plugin stores information about how far logstash has come into processing the files in .sincedb files in the users $HOME directory. If you want to re-process your logs, you have to delete these files.

Use the input generate stage

You can add text lines you want to run through filters and output stages directly in the config file by using the generate input filter.

input {
  generator{
    lines => [
      '{"@message":"fisk"}',
      '{"@message": {"fisk":true}}',
      '{"notMessage": {"fisk":true}}',
      '{"@message": {"clicked":true}}'
      ]
    codec => "json"
    count => 5
  }
}

Use mutate add_tag after each successful stage.

If you are developing configuration on a live system, adding tags after each stage makes it easy to search up the log events in Kibana/Elasticsearch.

filter {
  mutate {
    add_tag => "before conditional"
  }
  if [@message][clicked] {
    mutate {
      add_tag => "already had it clicked here"
    }
  } else {
      mutate {
        add_field  => [ "[@message][clicked]", false]
    }
  }
  mutate {
    add_tag => "after conditional"
  }
}

Developing grok filters with the grok debugger app

The grok filter comes with a range of prebuilt patterns, but you will find the need to develop your own pretty soon. That’s when you open your browser to https://grokdebug.herokuapp.com/ Paste in a representative line for your log, and you can start testing out matching patterns. There is also a discover mode that will try to figure out some fields for you.

The grok constructor, http://grokconstructor.appspot.com/do/construction offers an incremental mode, which I have found quite helpful to work with. You can paste in a selection of log lines, and it will offer a range of possibilities you can choose from, trying to match one field at a time.

SISO

If possible, pre-format logs so Logstash has less work to do. If you have the option to output logs as valid json, you don’t need grok filters since all the fields are already there.

This has been a short runthrough of the tips and tricks we remember to have used. If you know any other nice ways to develop Logstash configurations, please comment below.

Elastic{ON}15: Day two

Christoffer Vig — Thu, 19 Mar 2015 20:59:41 +0000

March 11, 2015

Keynote

Fighting the crowds to find a seat for the keynote at Day 2 at elastic{ON}15 we were blocked by a USB stick with the curious caption Microsoft (heart) Linux. Things have certainly changed.

Microsoft

The keynote, led by Elastic SVP of sales Aaron Katz, included Pablo Castro of Microsoft who was keen to explain how this probably isn’t so far from the truth. Elasticsearch is used internally in several Microsoft products among Linux and other open source software and this is a huge change from the Microsoft we know from around five years ago. Pablo revealed some details towards how elasticsearch is used as data storage and search platform in MSN, Microsoft Dynamics and Azure Search. Microsoft truly has gone through some fundamental changes lately embracing open source both internally and externally. We see this as a demonstration of the power of open source and the huge value of Elastic(search) brings to many organizations. As Jordan Sissel said in the keynote yesterday “If a user has a problem, it is a bug”. This is a philosophical stance towards a conception of software as an enabler of creativity and growth, in contrast to viewing software as a fixed product packaged for sale.

Goldman Sachs

Microsofts contribution was in the middle part of the keynote. The first part was a discussion with Don Duet, managing director of Goldman Sachs. Goldman Sachs provides financial services on a global scale, and has been on the forefront of technology since its inception in 1869. They were an early adopter of Elasticsearch since it was as an easy to use search and analytics tool for big data. Goldman Sachs is now using elasticsearch extensively as a key part of their technological stack.

NASA

The most mind blowing part of the keynote was the last one held by two chaps from the Jet Propulsion Labs team at NASA, Ricky Ma and Don Isla. They first showed their awesome internal search with previews, and built in rank tuning. Then they talked about the Mars Curiosity rover, a robot planted on Mars which runs around taking samples and selfies. It constantly sends data back to earth where the JPL team analyzes the operations of the rover. Elasticsearch is naturally at the center of this interplanetary operation, nothing less.

Mars Curiosity Rover Selfie

The remainder of the day contained sessions across the same three tracks as the first day. In addition five tracks of birds of a feather or “lounge” sessions were held where people gathered in smaller groups to discuss various topics. Needless to say the breadth of the program meant we were stretched thin. We chose to focus on three topics that are of particular importance to our customers: aggregations, security & Shield, and resiliency

More aggregations

Adrien Grand & Colin Goodheart-Smithe did a deep dive into the details of aggregations and how they are computed. In particular how to tune them and the results in terms of execution complexity. A key point is the approximations that are employed to compute some of the aggregations which involve certain trade offs in speed over accuracy. Aggregations are a very powerful feature requiring some some planning to be feasible and efficient.

Security/Shield

Uri Boness talked about Shield and the current state of authentication & authorization, He provided some pointers to what is on the roadmap for the coming releases. Unfortunately, there does not appear to be any concrete plans for providing built in document level security. This is a sought after feature that would certainly make the product more interesting in many enterprise settings. Then again, there are companies who provide connector frameworks that include security solutions for elasticsearch. We had a chat with some of them at the conference, including Enonic, SearchBlox and Search Technologies.

Facebook

Peter Vulgaris from Facebook explained how they are using elasticsearch. To me, the story resembled Microsoft’s. Facebook has heaps of data, and lots of use cases for it. Once they started to use elasticsearch it was widely adopted in the company and the amount of data indexed grew ever larger which forced them to think more closely about how they manage their clusters.

Resiliency

Elasticsearch is a distributed system, and as such shares the same potential issues as other distributed systems. Boaz Leskes & Igor Motov explained the measures that have been undertaken in order to avoid problems such as “split-brain” syndrome. This is when a cluster is confused about what node should be considered the master. Data safety and security are important features of Elasticsearch and there is a continuous effort in place in these areas.

Lucene

Stepping back to day 1 and the Lucene session featuring the mighty Robert Muir, we learned that Lucene version 5 includes a lot of improvements. Especially performance wise regarding compression both on indexing and query times which enables faster execution times and reduces resource consumption. There has also been made efforts to the Lucene core enabling a merging of query and filter as two sides of the same coin. After all a query is just a filter with a relevance score. On another note Lucene will now handle caching of queries by itself.

Wrapping it up

Elastic{ON}15 stands as a confirmation of the attitude that were essential in the creation of the elasticsearch project. The visions that guided the early development are still valid today, except the scale is larger. The recent emphasis on stability, security and resiliency will welcome a new wave of users and developers.

At the same time there is a continuous exploration and development into big data related analytics but with the speed and agility we have come to expect from Elasticsearch.

Thanks for this year, looking forwards to next!

Elastic{ON}15: Day one

Christoffer Vig — Wed, 11 Mar 2015 16:07:48 +0000

March 10, 2015

At Comperio we have been speculating for a while now that Elasticsearch might just drop search from their name. With Elasticsearch spearheading the expansion of search into analytics and all sorts of content and data driven applications such a change made sense to us. What the name would be we had no idea about however – ElasticStash, KibanElastic StashElasticLog – none of these really rolled of the tongue like a proper brand.

More surprising is the Elasticsearch move into the cloud space by acquiring Found. A big and heartfelt congratulations to our Norwegian colleagues from us at Comperio. Found has built and delivered an innovative and solid product and we look forward to seeing them build something even better as a part of Elastic.

Elasticsearch is renamed to Elastic, and Found is no longer just Found, but Found by Elastic. The opening keynote held by CEO Steven Shuurman and Shay Banon was a tour of triumph through the history of Elastic, detailing how the company has grown sort of in an organic, natural manner, into what it is today. Kibana and Logstash started as separate projects but were soon integrated into Elastic. Shay and Steven explained how old roadmaps for the development of Elastic included plans to create CloudES, search as a cloud service. CloudES was never created, due to all the other pressing issues. Simultaneously, the Norwegian company Found made great strides with their cloud search offering, and an acquisition became a very natural fit.

Elastic{ON} is the first conference devoted entirely to the Elastic family of products. The sessions consist on one hand of presentations by developers and employees of Elastic, on the other there is “ELK in the wild” showcasing customer use cases, including Verizon, Github, Facebook and more.

On day one the sessions about core elasticsearch, Lucene, Kibana and Logstash were of particular interest to us.

Elasticsearch

The session about “Recent developments in elasticsearch 2.0” held by Clinton Gormley and Simon Wilnauer revealed a host of interesting new features in the upcoming 2.0 release. There is a very high focus on stability, and making sure that no releases should contain bugs. To illustrate this Clinton showed graphs relating the number of lines of code compared to lines of tests, where the latter was rising sharply in the latest releases. It was also interesting to note that the number of lines of code has been reduced recently due to refactoring and other improvements to the code base.

Among interesting new features are a new “reducer” step for aggregations allowing calculations to be done on top of aggregated results and a Changes API which helps managing changes to the index. The Changes API will be central in creating other features, for example update by query, where a typical use case involves logging search results, where the changes API will allow updating information about click activity in the same log entry as the one containing the query.

There will also be a Reindex API that simplifies the development cycle when you have to refeed an entire index because you need to change a mapping or field type.

Kibana

Rashid Khan went through the motivations behind the development of Kibana 4, where support for aggregations, and making the product easier to work with and extendable really makes this into a fitting platform for creating tools for creating visualizations of data. Followed by “The Contributor’s Guide to the Kibana Galaxy” by Spencer Alger who demoed how to setup the development environment for Kibana 4 using using npm, grunt and bower- the web development standard toolset of today ( or was it yesterday?)

Logstash

Logstash creator Jordan Sissel presented the new features of Logstash 1.5, and what to expect in future versions. 1.5 introduces a new plugin system, and to great relief of all windows users out there the issues regarding file locking on rolling log files have been resolved! The roadmap also aims to vastly improve the reliability of Logstash, no more losing documents in planned or unplanned outages. In addition there are plans to add event persistence and various API management tools. As a consequence of the river technology being deprecated, Logstash will take the role as document processing framework that those of us who come from FAST ESP have missed for some time now. So in effect, all rivers, (including JDBC) will be ported to Logstash.

Aggregations

Mark Harwood presented a novel take on optimizing index creation for aggregations in the session “Building Entity Centric Indexes”. You may have tried to run some fancy aggregations,only to have elasticsearch dying from out of memory errors. Avoiding this often takes some insight into the architecture to
structure your aggregations in the best possible manner. Mark essentially showed how to move some of the aggregation to indexing time rather than query time. The original use case was a customer who needed to know what is the average session length for the users of his website. Figuring that out involved running through the whole index, sorting by session id, picking the timestamp of the first item and subtracting from the second, a lot of operations with an enormous consumption of resources. Mark approaches the problems in a creative and mathematical manner, and it is always inspiring to attend his presentations. It will be interesting to see whether the Changes API mentioned above will deliver functionality that can be used to improve aggregated data.

.NET

Deep dive into the .NET clients with Martijn Laarman showed how to use a strongly typed language as C# with elasticsearch. Yes, it is actually possible, and it looked very good. There is a low-level client that just connects to the api where you have to to do all the parsing yourself, and a high-level client called NEST building on top of that offering a strongly typed query DSL having almost 1 to 1 mapping to the elasticsearch dsl. Particularly nifty was the covariant result handling, where you can specify the type of results you need back, considering a search result from elasticsearch can contain many types.

Looking forwards to day 2!

Kibana 4 – the beer analytics engine

Christoffer Vig — Mon, 09 Feb 2015 00:20:36 +0000

Kibana 4 is a great tool for analyzing data. Vinmonopolet, the Norwegian government owned alcoholic beverage retail monopoly, makes their list of products available online in an easily digestible csv format. So, what beer should I buy next? Kibana will soon tell me.

Kibana 4 is a data visualization and analytics tool for elasticsearch. Kibana 4 was launched in February 2015, and builds on top of Kibana 3, incorporating user feedback and recent developments in elasticsearch, the most mind blowing being the support for aggregations. Aggregations are like facets/navigators/refiners on steroids, with a lot of advanced options for data drill-down. But no matter how easy a tool is to use, it only gets interesting once we have some questions that need to be answered. So what I want to know is:

1. What beer gives the most value for money?

2. What is the most Belgian of Belgian beers?

3. Which of the most Belgian beers give the most value for money?

The dataset from Vinmonopolet does not contain the important metric “price pr unit of alcohol”. So to begin with, we need to add that. It could have been done in Excel, or as part of preprocessing. Since this post isn’t about how to get data data indexed in elasticsearch we use a nice new feature of Kibana that lets you add calculated fields.

In the Settings -> Indices section, there is an an option to create a Scripted Field.

The field for price pr.unit of alcohol is added as calculation in the scripted field, flooring the number to the nearest integer. Scripting is done using Lucene Expressions, after some vulnerabilites were discovered with using Groovy as scripting language (this changed from RC to the final release of Kibana).

What beer gives the most value for money?

Now we can create a nice little bar chart in Kibana. Using the minimum pricePrAlcohol as Y-axis, bottom terms Varenavn as X-axis.

The chart reveals that beer with the best alcohol/price ratio is Mikkeler Årh Hvad?!, A very nice beer, I had it last week. Mikkeler is a Danish brewery, but they brew most of their beer in Belgium, so this is actually a Belgian beer.

What is the most Belgian of Belgian beers?

Next up I want to figure out what is the most Belgian of Belgian beers. Most of the products in Vinmonopolet’s catalogue have entries for “Smak”, or “Taste”. Let’s put the significant terms aggregation to work on “Smak” and see what falls out.

The pie chart shows countries in the inner circle, and significant terms in the outer circle. The largest pie belongs to Norwegian beers, as shown in the legend on the right. Using Kibana, you can also hover over the entries to highlight the selection in the pie chart, very nice feature especially for the colourly challenged population that are unable to match colors. Kibana allows drill down by clicking on pie slices, and you can see the data table and other details by clicking on the small arrow at the bottom.

The most significant terms for Belgian beers according to this query is “bread”, “yeast”, “malt” and “malty”. That’s hardly surprising since this is beer. We should expect something a little more specific. The significant terms aggregation returns terms that are more frequent in a foreground selection, compared to a background selection. In our case, we select product of type beer, from country Belgium, and the background is by default the contents of the entire index, or in other words, the complete product catalog from Vinmonopolet. This catalog contains a vast amount of wine, liquor and other irrelevant items. Since we are really only interested to see the significant terms of Belgian beers compared to other beers, we can add a custom parameter to select the background manually. Paste this into the JSON input of the advanced section.

{
    "background_filter": {
        "term": {
            "Varetype": "Øl"
        }
    }
}

Using this filter, the significant terms for Belgian beers are “impact, plum, lemon, bread”.

What beers actually match these descriptions? Some suggestions can be revealed through nesting an aggregation on product name, on top of the one we already have.

The non-colourly-challenged may easily see that Het Anker Lucifer matches both “anslag” (impact) and “sitrus” (lemon). Some beers match two terms, others match one, none match all four terms. Ideally, the most Belgian of Belgian beers should contain all the most significant terms. The significant terms are “impact,bread,lemon,plum” (“anslag,brødbakst,sitrus,plomme”). Typing this as a Lucene query into the Discover tab on Kibana.

Land:Belgia AND Smak:sitrus,plomme,brødbakst,anslag AND Varetype:Øl

Returns “Silly Green Killer IPA” at result number 1, having Smak: “Fruktig anslag med sitrus, korn, humle og brødbakst. Lang, frisk avslutning.” Containing three of the terms; impact, lemon and bread. Since no beers contain all four terms, we can hereby pronounce a winner of most Belgian of all Belgian beers according to Vinomonpolet catalogoue (and a ridiculous significant terms trick): Silly Green Killer IPA! Congratulations!

Which of the most Belgian beers give the most value for money?

The previous investigation did not take economic considerations into account. Using the Line Chart, reusing the saved search from the previous query, adding the minimum pricePrAlcohol as Y-axis, and setting the X-axis to the terms aggregation for Varenavn (product name) bumping it up to 52 entries to make sure it contains all the results. The graph shows all beers containing at least one of our sought after terms. The Silly Green Killer IPA can be found at the upper quart of the table having a price pr alcohol unit at 27.51. Abbaye de Rocs Bruin comes in as a winner at the bottom edge of the scale, with a whooping 13.43 NOK pr alcohol unit, having a Smak field containing only the term “sitrus” (lemon).

It would be nice to see what terms the beers contain, to enable a qualified judgement. Kibana allows to split up the display into several graphs. I will use this together with the filter aggregation to show one graph for each of the significant terms.

The graphs are, from top to bottom: sitrus (lemon), brødbakst (bread), anslag (impact), plomme (plum). The colors indicate alcohol content.

In this post, I have tried to show how you can use Kibana 4 and elasticsearch for data exploration and analysis. Please use the comment form below or contact me if you have any questions. If you enjoyed this article, why don’t you give me a toast on Untapped!