Search Nuggets » Technology

Sitevision – förbättra söket med Nutch

Jack Thorén — Wed, 08 Jun 2016 14:20:07 +0000

Ett av Sveriges mest populära CMS verktyg är Sitevision, som används kanske främst av stora statliga myndigheter och kommuner. Valet att använda sig av Sitevision hos dessa myndigheter och kommuner är nog att det är väldigt enkelt för redaktörer och sidansvariga att använda och att underhålla informationen på sina sidor. Detta i en miljö där kanske den webbtekniska kunskapen inte är på samma nivå som hos ett större teknikföretag.

Men medans vi hyllar det enkla användargränssnittet så önskar vi att det gick att bygga bättre sökfunktionalitet. Visst kan du söka i en webbsajt, det går även att söka på andra webbsajter, om du har satt upp flera webbsajter inom samma system. Men om du vill söka i en webbsajt eller databas som finns någon annanstans då går det inte. Men detta är på väg att ändras. Sitevision introducerar snart webbkravlaren Nutch, en mycket avancerad webcrawler som bygger på Hadoop, som i sin tur är del av ett ramverk för att hantera mycket stora mängder data. Nutch tillsammans med Solr kommer att lyfta Sitevisions sök till nya höjder.

Nedan är ett schema för hur en sajt-indexering skulle kunna se ut:

1. ”Injector” tar alla webbadresser i nutch.txt filen och lägger till dem i ”CrawlDB”. Som är en central del av Nutch. CrawlDB innehåller information om alla kända webbadresser (hämta schema, hämta status, metadata, …).

2. Baserat på data från CrawlDB skapar ”Generator” en lista på vad som ska hämtas och placerar det i en nyskapad segment katalog.

3. Nästa steg, ”Fetcher” får de adresser som ska hämtas från listan och skriver det tillbaka till segment katalogen. Detta steg är vanligtvis den mest tidskrävande delen.

4. Nu kan ”Parser” behandla innehållet i varje webbsida och exempelvis utelämnar alla html-taggar. Om denna hämtning (crawl) är en uppdatering eller en utökning av en redan tidigare hämtning (t.ex. djup 3), skulle ”Updater” lägga till nya data till CrawlDB som ett nästa steg.

5. Före indexering måste alla länkar inverteras av ”Link Inverter”, som tar hänsyn till att inte antalet utgående länkar på en webbsida är av intresse, utan snarare antalet inkommande länkar. Detta är ganska likt hur Google Pagerank fungerar och är viktig för scoring funktion. De inverterade länkarna sparas i “Linkdb”.

6-7. Med hjälp av data från alla möjliga källor (CrawlDB, LinkDB och segment), skapar Indexer ett index och sparar det i Solr katalogen. För indexering, används det populära Lucene biblioteket. Nu kan användaren söka efter information om genomsökta webbsidor via Solr.

Funktionalitet som följer med Nutch:

Indexering av externa källor
Automatisk kategorisering
Metadata
Textanalys
Utökad funktionalitet enkelt med plugins

Nyttiga länkar:

Experimenting with Open Source Web Crawlers

Mridu Agarwal — Fri, 29 Apr 2016 11:03:42 +0000

Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site, web scraping has many uses.

In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not only easily available but quite powerful as well. In this article I am mostly going to cover their basic features and how easy they are to start with.

If you are like one of those persons who likes to quickly get started while learning something, I would suggest that you try OpenWebSpider first.

It is a simple web browser based open source crawler and search engine which is simple to install and use and is very good for those who are trying to get acquainted to web crawling . It stores webpages in MySql or MongoDb. I used MySql for my testing purpose. You can follow the steps here to install it. It’s pretty simple and basic.

So, once you have installed everything , you just need to open a web-browser at http://127.0.0.1:9999/ and you are ready to crawl and search. Just check your database settings, type the Url of the site you want to crawl and within couple of minutes, you have all the data you need. You can even search it going to the search tab and typing in your query. Whoa! That was quick and compact and needless to say you don’t need any programming skills to crawl it.

If you are trying to create an off-line copy of your data or your very own mini Wikipedia, I think go for this as it’s the easiest way to do it.

Following are some screen shots:

You can also see the this Search engine demo here, before actually getting started.

Ok, after getting my hands on into web crawling, I was curious to do more sophisticated stuff like extracting topics from a web site where I do not have any RSS feed or API. Extracting this structured data could be quite important to many business scenarios where you are trying to follow competitor’s product news or gather data for business intelligence. I decided to use Scrapy for this experiment.

The good thing about Scrapy is that it is not only fast and simple, but very extensible as well. While installing it on my windows environment, I had few hiccups mainly because of the different compatible version of python but in the end, once you get it, it’s very simple(Isn’t that how you feel anyways , once things works ? Anyways, forget it! :D). Follow these links, if you are having trouble installing Scrapy like me:

https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment

http://doc.scrapy.org/en/latest/intro/install.html#intro-install

After installing, you need to create a Scrapy project. Since we are doing more customized stuff than just crawling the entire website, this requires more effort and knowledge of programming skills and sometime browser tools to understand the HTML DOM. You can follow this link to get started with you first Scrapy project .Once you have crawled the data that you need, it would be interesting to feed this data into a search engine. I have also been looking for open source web crawlers for Elastic Search and this looked like the perfect opportunity. Scrapy provides integration with Elastic Search out of the box , which is awesome. You just need to install the Elastic Search module for Scrapy(of course Elastic Search should be running somewhere) and configure the Item Pipeline for Scrapy. Follow this link for the step by step guide. Once done, you have the fully integrated crawler and search system!

I crawled http://primehealthchannel.com and created an index named “healthitems” in Scrapy.

To search the elastic search index, I am using Chrome extension Sense to send queries to Elastic Search, and this is how it looks

GET /scrapy/healthitems/_search

I hope you had fun reading this and now wants to try some of your own cool ideas . Do let us know how you used it and which crawler you like the most!

Extern data sökbart i SharePoint Online

Joel Lindefors — Tue, 26 Apr 2016 14:33:45 +0000

I min förra bloggpost om extern data i SharePoint Online presenterade jag ett exempel på hur vi utan att behöva koda kan arbeta med extern data i SharePoint Online. Nackdelen med den lösningen var att datat i det fallet inte blev sökbart i och med att vi använde oss av en extern innehållstyp som var kopplad till en extern lista. I den här bloggposten ska jag inte bara presentera ett exempel på hur vi kan göra externt data möjligt att arbeta med utan också sökbart i SharePoint Online med hälp av en OData Service, en vanlig SharePoint lista och Javascript Object Model. Vad vi kan göra med datat i SharePoint Online är helt beroende på vilka CRUD operationer vi väljer att lägga till i den javascript kod vi skriver. Väljer vi att lägga till alla operationer kommer vi också kunna läsa, lägga till, uppdatera och ta bort data ur databasen.

I mitt exempel här nedan har jag använt mig av en redan befintlig och publik OData service. Om det inte redan finns någon sådan som ni kan använda er av måste ni skapa upp en sådan först. Mer info kring hur man skapar upp OData service kan ni få om ni klickar på länken. Jag har också redan skapat upp en vanlig ”Custom List” och gett den namnet ”List 1” i min SharePoint Online site. Jag har även skapat upp en kolumn som heter ”Description” utöver ”Title” kolumnen som redan finns. Om du har ett scenario där den externa databasen även kan uppdateras från annat ställe än din SharePoint Online lista behöver du också skapa upp något som uppdaterar din SharePoint lista exempelvis ett Windows Azure Web Job.

Välj ”File”, ”New”, ”Project” i Visual Studio och välj därefter ”SharePoint Add-In”. Ge projektet ett namn och välj SharePoint-hosted.

I Visual Studio öppnar vi nu upp foldern Scripts och klickar på App.js

Lägg in nedanstående kodsnuttar i App.js. Notera att i exemplet nedan har vi bara valt en Read operation, så lägger man till, editerar eller tar bort objekt så kommer inget att hända i databasen.

Nu behöver vi ge appen behörighet till listan. Klicka på AppManifest i Solution Explorer fönstret i Visual Studio och välj ”Permissions”. I exemplet nedan har jag gett appen full behörighet till hela site collection.

Klicka på fliken för ”Remote Endpoint” och lägg till hosturlen till OData servicen.

Nu deployar vi vår lösning och när deployen är färdig är vi redo att gå in i vår SharePoint Online lista och se att alla objekt har kommit in

Vi kan nu också gå till vår sök och söka efter ett av våra objekt. Jag valde att söka på Condiments och nedan är resultatet.

Email: joel.lindefors@comperiosearch.com

Extern data i SharePoint Online

Joel Lindefors — Tue, 26 Apr 2016 12:20:49 +0000

I den här bloggposten kommer vi presentera ett exempel på hur man kan göra data från en Azure SQL Server databas tillgängligt att arbeta med i SharePoint Online genom att skapa upp en Business Connectivity Service, extern innehållstyp och en extern lista. Fördelen med exemplet är att man med endast konfigurering kan koppla en databas från Azure till SharePoint Online och genom SharePoint Online kan uppdatera databasen (beroende på vilka operationer vi väljer att lägga till i externa innehållstypen). Det här är en av få vägar att på ett enkelt sätt kunna få in externt data utan att på egen hand behöva genomföra någon kodning. Nackdelen med exemplet är att den externa datan inte blir sökbar i SharePoint Online. Nedanstående exempel går att finna på msdn med skillnaden att de där också skapar upp en databas i Azure SQL Server och att jag inte fick det att fungera förrän jag lagt till en brandväggsregel i Azure SQL Server databasen.

Skapa Business Connectivity Service genom att i SharePoint Admin Center välja bcs och sedan ”Manage connections to online services”.

Klicka på “Add”. Ange ett namn för kopplingen och service adressen, vilket är adressen till SQLServern i Azure.

Skapa Secure Store Service ID genom att klicka på Secure Store i SharePoint Admin Center och sedan välja ”New”.

Fyll i “Application ID”, “Display Name”, “Contact E-Mail”. Notera att ”Application ID” inte kan ändras när det väl har skapats.

Lägg till de fält som behöver användas för att accessa data i ”Target Application”. Som default är fälten ”Windows User Name” och ”Windows Password” tillagt med fälttyperna ”User Name” och ”Password” kopplade till sig och där ”Password” är maskerat. Man kan välja att lägga till andra fält också om man behöver.

Lägg till ”administrators” och ”members” av ”Target Application”.

Klicka ok. Du kommer nu tillbaka till ”Secure Store Service” sidan.

Nu när vi har skapat upp ”Target Application” behöver vi sätta ”credentials” som Secure Store använder för att hämta data från Azure SQL databasen. Välj den ”application” som du nyss skapat och klicka ”Set Credentials”. Använd samma användarnamn och lösenord som det användarnamn och lösenord som användes när din Azure SQL Database skapades upp.

Nu behöver vi ändra vilka “credentials” som ska användas för vår bcs. Vi angav tidigare ”User´s Identity”. Nu ska vi ändra det till ”Credentials stored in SharePoint”. Klicka på bcs i SharePoint Admin och välj ”Manage Connections to online servers” och välj det namn som du angav för din bcs koppling. Klicka på ”Properties” i menyn.

Ändra till ”Use Credentials Stored in SharePoint”. Ange det ”Secure Store Application ID” som du angav när du skapade upp din ”Target Application” i Secure Store.

I SharePoint Admin klicka på bcs och sedan ”Manage BDC Models and External Content Types”.

Välj ”Set Metadata Store Permissions” och välj användare. Minst en användare måste ha full behörighet.

Nu ska vi skapa upp vår externa innehållstyp i SharePoint Designer. Öppna upp SharePoint Designer och välj din SharePoint Online site. Klicka på ”External Content Types” i vänstermenyn och välj ”External Content Type” i menyn högst upp enligt bilden nedan.

Ge din innehållstyp ett namn och klicka sedan på ”Discover external datasources and define operations.”

Klicka på ”Add connection” knappen.

Ange adressen till din Azure SQL Server databas och namnet på din Azure SQL databas. Välj ”Connect with Impersonated Custom Identity”.

När du klickar ok kommer du få upp en inloggningsruta. Där anger du det användarnamn och det lösenord som du angivit för att nå Azure SQL Databasen.

I ”Data Source Explorer” har vi nu fått upp vår databas. Öppna upp ”Tables” och högerklicka på den tabell som du vill göra CRUD operationer på. Välj vilken/vilka operationer du vill kunna göra mot tabellen.

I det här exemplet har jag valt att skapa upp alla operationer och när de är färdigskapade kommer vi få upp nedanstående fönster. Klicka ”Next”.

I ”Parameters Configuration” fönstret kan vi få upp felmeddelande och varningar. I nedanstående exempel har vi fått upp en varning. För att få bort den varningen behöver vi klicka i ”Show in Picker”. Klicka sedan ”Next”.

I fönstret som nu öppnats ska vi lägga till ett filter så vi klickar ”Add Filter Parameter”. Vi använder oss av CustomerID för att ange en begräsning av 2000 objekt. För att ange begränsningen klickar vi på ”Click to add” bredvid ”Filter”.

Vi sätter ”Filter Type” till ”Limit” och klickar ”OK”.

Nu sätter vi ”Default value” till 2000 och klickar sedan ”Finnish”.

Klicka på spara knappen för att spara din externa innehållstyp till din SharePoint Online site. Nu när vi går tillbaka till SharePoint Admin Center och väljer bcs och ”Manage BDC Models and External Content Types kommer vi se vår koppling där. Klicka på drop down menyn och välj ”Set Permissions”.

Sök efter ”All Users” och lägg till både ”All Users (Windows)” och ”All Users (Membership)”. Ge båda ”Execute” och ”Selectable in Clients” behörigheter.

Innan vi kan skapa upp vår externa lista i vår SharePoint Online site måste vi också skapa upp en brandväggsregel till vår databas i Azure, så att Azure accepterar SharePoint att nå vår databas. Vi behöver ange en Start IP och Slut IP. Eftersom IP-adresserna i SharePoint Online är dynamiska behöver vi ange nedanstående Start och Slut IP.

Nu kan vi gå till ”Site Contents” på vår SharePoint Online site och välja ”add an app”. Lägg till en extern lista och välj den externa innehållstyp som vi skapat upp.

Ge listan ett namn och klicka skapa. Du borde nu kunna se dina externa objekt och beroende vilka operationer du lagt till editera, ta bort och lägga till objekt i listan.

Email: joel.lindefors@comperiosearch.com

Content Enrichment Web Service SharePoint 2013 – Advantages and Challenges

Mridu Agarwal — Tue, 26 Apr 2016 11:23:22 +0000

If you have worked with search solutions before, you will know that very often there is a need to process data before it can be displayed in search results. This processing might be required to address some of(but not limited to) these common issues:

Missing metadata issues
Inconsistent metadata issues
Cleansing of content
Integration of semantic layers/Automatic tagging
Integration with 3rd party service
Merging data from other sources

Content Enrichment Web Service in SharePoint 2013 is a SOAP-based service within the content processing component that can be used to achieve this. The figure below shows a part of the process that takes place in the content processing component of SharePoint search.

Content Enrichment Web Service SharePoint 2013 combines the goodness of both FAST for SharePoint Search and SharePoint Search to offer a whole new set of possibilities and has its own challenges. To see an implementation example, check the MSDN link which pretty much sums up the basic steps. In this post we are going to look at some of the advantages and challenges of CEWS coming from a FAST 2010 background:

1. CEWS is a service and you DON’T have to deploy it in your SharePoint environment: Perhaps this is the biggest architectural change from the content processing perspective. What this means is that your code no longer runs in a sandbox environment within SharePoint Server. The webservice can be hosted anywhere outside your SharePoint server thus reducing deployment headaches and huge number of approvals required to deploy the executable files. I can see operations/infrastructure team/administrators smiling.

2.The web service processes and returns managed properties, not crawled properties: Managed properties correspond to what actually gets indexed and displayed in search results. So, this reduces some of the confusion as why I cant see the updated results( perhaps you had forgotten to map your crawled property to a managed property and wait you will have to index it AGAIN!). Nightmare!

3. You can define a trigger to limit the set of items that are processed by the web service: In FAST 2010, each item had to pass through the pipeline whether you wanted to process it or not. This check had to be done in the code. Trigger in 2013 will allow us to define this check outside the code so that only for selected content, web service is called. This will optimize the overall performance and improve crawling time, if you only want to process a subset of the content.

So far, so good! But.. there are certain challenges we need to look at and see how we can overcome it. In fact, this is the most important part when you are architecting your CEWS solution:

1. The content enrichment callout step can only be configured with a single web service endpoint : Now this sounds very limiting. I have multiple search applications and earlier I maintained the logic in different solutions. Do I need to combine them all into a single service? What about the maintenance and change request? Well there are several possible technologies one could consider to solve this but what I did in my project was to create a WCF routing service and let the routing service handle my multiple web services based on filters. You could also use it to implement load-balancing and fault tolerance. Here in the following example, I have two content sources “xmlfile” and “EpiFileShare”. I want to have two different services “xmlsvc” and “episvc” to process these different sources. This is how I will configure the end points in my WCF Routing Service: 2. Only one condition can be configured for Trigger. Different search application will require different triggers: Now, this can again be solved by using WCF routers and filters and configuring separate endpoints for separate triggers. Here I am using default managed property “ContentSource” as a trigger/filter to determine my service endpoint. To summarize, I have shown some of the advantages and challenges of the new CEWS architecture in SharePoint 2013 search and how you can overcome it. Hope that now you want to try this soon and share your experience with us.

ELK stack deployment with Ansible

Christoffer Vig — Thu, 26 Nov 2015 09:59:38 +0000

As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way.

Ansible is a free software platform for configuring and managing computers, and I’ve been using it a lot lately to manage the ELK stack. Elasticsearch, Logstash and Kibana.

I can define a list of servers I want to manage in a YAML config file – the so called inventory:

[elasticearch-master]
es-master1.mydomain.com
es-master2.mydomain.com
es-master3.mydomain.com

[elasticsearch-data]
elk-data1.mydomain.com
elk-data2.mydomain.com
elk-data3.mydomain.com

[kibana]
kibana.mydomain.com

And define the roles for the servers in another YAML config file – the so called playbook:

- hosts: elasticsearch-master
  roles:
    - ansible-elasticsearch

- hosts: elasticsearch-data
  roles:
    - ansible-elasticsearch

- hosts: logstash
  roles:
    - ansible-logstash

- hosts: kibana
  roles:
    - ansible-kibana

Each group of servers may have their own files containing configuration variables.

elasticsearch_version: 2.1.0
elasticsearch_node_master: false
elasticsearch_heap_size: 1000G

Ansible is used for configuring the ELK stack vagrant box at https://github.com/comperiosearch/vagrant-elk-box-ansible, which was recently upgraded with Elasticsearch 2.1, Kibana 4.3 and Logstash 2.1

The same set of Ansible roles can be applied when the configuration needs to move into production, by applying another set of variable files with modified host names, certificates and such. The possible ways to do this are several.

How does it work?

Ansible is agent-less. This means, you do not install anything (an agent) on the machines you control. Ansible needs only to be installed on the controlling machine (Linux/OSX) and connects to the managed machines (some support for windows, even) using SSH. The only requirement on the managed machines is python.

Happy ansibling!

Analysing Solr logs with Logstash

Seb Muller — Sun, 20 Sep 2015 22:00:00 +0000

Analysing Solr logs with Logstash

Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you’re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK stack, both Christoffer and André have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.

As a little side note for the truly devoted Solr users, an ELK stack alternative exists with SiLK. I highly recommend checking out Lucidworks’ various blog posts on Solr and search in general.

Some background

On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio’s search middleware application.

Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.

Lets get started

I’m going to assume you already have a running Solr installation. You will, however, need to download Elasticsearch and Logstash and unpack them. Before we start Elasticsearch, I recommend installing these plugins:

Head
Marvel

Head is a cluster health monitoring tool. Marvel we’ll only need for the bundled developer console, Sense. To disable Marvel’s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml

marvel.agent.enabled: false

Start elasticsearch with this command:

~/elasticsearch-[version]/bin/elasticsearch

Navigate to http://localhost:9200/ to confirm that Elasticsearch is running. Check http://localhost:9200/_plugin/head and http://localhost:9200/_plugin/marvel/sense/index.html to verify the plugins installed correctly.

The anatomy of a Logstash config

Update 21/09/15

I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: The rest of the original article contents are unchanged for comparison’s sake.

All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:

input {
  file {
    path => "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host => "localhost"
    template => "~/logstash/bin/logstash_solr_template.json"
    index => "solr-%{+YYYY.MM.dd}"
    template_overwrite => true
  }

This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the input and output plugins’ documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used here.

To process the Solr logs, we’ll use the grok, mutate, multiline, drop and kv filter plugins.

Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the grok debugger app is particularly helpful. Be mindful though that some of the escaping syntax isn’t always the same in the app as what the Logstash config expects.
We need the multiline plugin to link stacktraces to their initial error message.
The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events
We use mutate to add and remove tags along the way.
And finally, drop to drop any events we don’t want to keep.

The hard fun part

Lets dive into the filter stage now. Take a look at the config file I’m using. The Grok patterns may appear a bit daunting, especially if you’re not very familiar with regexp and the default Grok patterns, but don’t worry! Lets break it down.

The first section extracts the log event’s severity and timestamp into their own fields, ‘level’ and ‘LogTime’:

grok {
    match => { "message" => "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure => []
  }

So, given this line from my example log file

INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.id=epifile_211278&literal.epifileid_s=211278&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&literal.filesource_s=SiteFile} {} 0 65

We’d extract

{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}

In the template file I linked earlier, you’ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we’d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you’ll notice I use

tag_on_failure => []

in most of my Grok stages. The default value is “_grokparsefailure”, which I don’t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.

The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.

# Combine commit events into single message
  multiline {
      pattern => "^\t(commit\{)"
      what => "previous"
    }

Now we come to a major section for handling general INFO level messages.

# INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure => []  
    }
    if [params] {
      kv {
        field_split => "&"
        source => "params"
      }
    } else {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure => [ "drop" ]
        add_field => {
          "action" => "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }

This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document’s extracted contents look like when stored in Elasticsearch:

{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&literal.epi_file_title=A05100_Tass5+Trondheim.pdf&literal.title=A05100_Tass5+Trondheim.pdf&literal.id=epifile_211027&literal.epifileid_s=211027&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}

If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with “drop”. Finally, any messages tagged with “drop” are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:

.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}

The next section handles ERROR level messages:

# Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern => "^\s"
      what => "previous"
      add_tag => [ "multiline_pre" ]
    }
    multiline {
        pattern => "^Caused by"
        what => "previous"
        add_tag => [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure => []
      }
    }
  }

Given a stack trace, there’s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.

Finally, I drop any empty lines and clean up temporary tags:

# Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag => [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }

To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:

# aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}

You should get back all your processed log events along with an aggregation on event severity.

Conclusion

Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I’ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.

Elasticsearch: Shield protected Kibana with Active Directory

Christoffer Vig — Fri, 21 Aug 2015 14:26:45 +0000

Elasticsearch easily stores terabytes of data, but how can you make sure users only see the data they should? This post will explore how to use Shield, a plugin for Elasticsearch, to authenticate users with Active Directory.

Elasticsearch will by default allow anyone access to all data. The Shield plugin allows locking down Elasticsearch using authentication from the internal esusers realm, Active Directory (AD) or LDAP . Using AD, you can map groups defined in your Windows domain to roles in Elasticsearch. For instance, you can allow people in the Fishery department access only to fish-indexes, and give complete control to anyone in the IT department.

To use Shield in production, you have to buy an Elasticsearch subscription, however, you get a 30-day trial when installing the license manager. So let’s hurry up and see how this works out in Kibana.

In this post, we will install Shield and connect to Active Directory (AD) for authentication. After having made sure we can authenticate with AD, we will add SSL encryption everywhere possible. We will add authentication for the Kibana server using the built in authentication realm esusers, and if time allows at the end, we will create two user groups, each with access to its own index, and check how it all looks when accessed in Kibana 4.

Prerequisites

You will need a previously installed Elasticsearch and Kibana. The most recent versions should work, I have used Elasticsearch 1.7 and Kibana 4.1.1 If you need a machine to test on, I can personally recommend the vagrant-elk-box you can find here: The following guide assumes the file locations of the vagrant-elk-box, if you install differently, you will probably know where to look. Ask an adult for help.

For Active Directory, you need to be on a domain that uses Active Directory. That would probably mean some kind of Windows work environment.

Installing Shield

If you’re on the vagrant box you should begin the lesson by entering the vagrant box using the commands

vagrant up
vagrant ssh

Install the license manager

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/license/latest

Install Shield

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/shield/latest

Restart elasticsearch. (service elasticsearch restart)

Check out the logs, you should find some information regarding when your Shield license will expire (logfile location: /var/log/elasticsearch/vagrant-es.log)

Integrating Active Directory

The next step involves figuring out a thing or two about your Active Directory configuration. First of all you need to know the address. Now you need to be on your windows machine, open cmd.exe and type

set LOGONSERVER

The name of your AD should pop back. Add a section similar to the following into the elasticsearch.yml file (at /etc/elasticsearch/elasticsearch.yml)

shield.authc.realms:
  active_directory:
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

Type in the address to your AD in the url: field (where it says url: ldap://ad.superdomain.com). If your logonserver is ad.cnn.com, you should type in url: ldap://ad.cnn.com

Also, you need to figure out your domain name and type it in correctly.

NB: Be careful with the indenting! Elasticsesarch cares a lot about correct indenting, and may even refuse to start without telling you why if you make a mistake.

Finding the Correct name for the Active Directory group

Next step involves figuring out the name for the Group you wish to grant access to. You may have called your group “Fishermen”, but that is probably not exactly what it’s called in AD.

Microsoft has a very simple and nice tool called Active Directory Explorer . Open the tool and enter the adress you just found from the LOGONSERVER (remember? it’s only 10 lines above)

You may have to click and explore a little to find the groups you want. Once you find it, you need the value for the “distinguishedName” attribute. You can double click on it and copy out from the “Object”.

This is an example from my AD

CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com

Now this value represents a group which we want to map to a role in elasticsearch.

Open the file /etc/elasticsearch/shield/role-mapping.yml. It should look similar to this

# Role mapping configuration file which has elasticsearch roles as keys
# that map to one or more user or group distinguished names

#roleA:   this is an elasticsearch role
#  - groupA-DN  this is a group distinguished name
#  - groupB-DN
#  - user1-DN   this is the full user distinguished name
power_user:
  - "CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com"
#user:
# - "cn=admins,dc=example,dc=com" 
# - "cn=John Doe,cn=other users,dc=example,dc=com"

I have uncommented the line with “power_user:” and added a line below containing the distinguishedName from above.

By restarting elasticsearch, anyone in the “Rolle IT” group should now be able to log in (and nobody else (yet)).

To test it out, open http://localhost:9200 in your browser. You should be presented with a login box where you can type in your username/password. In case of failure, check out the elasticsearch logs (at /var/log/elasticsearch/vagrant-es.log).

If you were able to log in, that means Active Directory authentication works. Congratulations! You deserve a refreshment. Some strong coffee, will go down well with the next sections, where we add encrypted communications everywhere we can.

SSL - Elasticsearch

Authentication and encrypted communication go hand in hand. Without SSL, username and password is transferred in plaintext on the wire. For this demo we will use self-signed certificates. Keytool comes with Java, and is used to handle certificates for Elasticsearch. The following command will generate a self-signed certficate and put it in a JKS file named self-signed.jks. (swap out $password with your preferred password)

keytool -genkey -keyalg RSA -alias selfsigned -keystore self-signed.jks -keypass $password -storepass $password -validity 360 -keysize 2048 -dname "CN=localhost, OU=orgUnit, O=org, L=city, S=state, C=NO"

Copy the certificate into /etc/elasticsearch/

Modify /etc/elasticsearch/elasticsearch.yml by adding the following lines:

shield.ssl.keystore.path: /etc/elasticsearch/self-signed.jks
shield.ssl.keystore.password: $password
shield.ssl.hostname_verification: false
shield.transport.ssl: true
shield.http.ssl: true

(use the same password as you used when creating the self-signed certificate )

Restart Elasticsearch again, and watch the logs for failures.

Try to open https://localhost:9200 in your browser (NB: httpS not http)

https://localhost:9200

You should a screen warning you that something is wrong with the connection. This is a good sign! It means your certificate is actually working! For production use you could use your own CA or buy a proper certificate, which both will avoid the ugly warning screen.

SSL – Active directory

Our current method of connecting to Active Directory is unencrypted – we need to enable SSL for the AD connections.

1. Fetch the certificate from your Active Directory server (replace ldap.example.com with the LOGONSERVER from above)

echo | openssl s_client -connect ldap.example.com:6362>/dev/null| openssl x509 > ldap.crt

2. Import the certificate into your keystore (located at /etc/elasticsearch/)

keytool -import-keystore self-signed.jks -file ldap.crt

3. Modify AD url in elasticsearch.yml
change the line

url: ldap://ad.superdomain.com

url: ldaps://ad.superdomain.com

Restart elasticsearch and check logs for failures

Kibana authentication with esusers

With Elasticsearch locked down by Shield, it means no services can search or post data either. Including Kibana and Logstash.

Active Directory is great, but I’m not sure I want to use it for letting the Kibana server talk to Elasticsearch. We can use the Shield built in user management system, esusers. Elasticsearch comes with a set of predefined roles, including roles for Logstash, Kibana4 server and Kibana4 user. (/etc/elasticsearch/shield/role-mapping.yml on the vagrant-elk box if you’re still on that one).

Add a new kibana4_server user, granting it the role kibana4_server, using this command:

cd /usr/share/elasticsearch/bin/shield  
./esusers useradd kibana4_server -p secret -r kibana4_server

Adding esusers realm

The esusers realm is the default one, and does not need to be configured if that’s the only realm you use. Now since we added the Active Directory realm we must add another section to the elasticsearch.yml file from above.

It should end up looking like this

shield.authc.realms:
  esusers:
    type: esusers
    order: 0
  active_directory:
    order: 1
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

The order parameter defines in what order elasticsearch should try the various authentication mechanisms.

Allowing Kibana to access Elasticsearch

Kibana must be informed of the new user we just created. You will find the kibana configuration file at /opt/kibana/config/kibana.yml.

Add in the username and password you just created. You also need to change the address for elasticsearch to using https

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: "https://localhost:9200"

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibana4_server
kibana_elasticsearch_password: secret

Restart kibana and elasticsearch, and watch the logs for any errors. Try opening Kibana at http://localhost:5601, type in your login and password. Provided you’re in the group you gave access earlier, you should be able to login.

Creating SSL for Kibana

Once you have enabled authorization for Elasticsearch, you really need to set SSL certificates for Kibana as well. This is also configured in kibana.yml

verify_ssl: false
# SSL for outgoing requests from the Kibana Server (PEM formatted)
ssl_key_file: "kibana_ssl_key_file"
ssl_cert_file: "kibana_ssl_cert_file"

You can create a self-signed key and cert file for kibana using the following command:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes

Configuring AD groups for Kibana access

Unfortunately, this part of the post is going to be very sketchy, as we are desperately running out of time. This blog is much too long already.

Elasticsearch already comes with a list of predefined roles, among which you can find the kibana4 role. The kibana4 role allows read/write access to the .kibana index, in addition to search and read access to all indexes. We want to limit access to just one index for each AD group. The fishery group shall only access the fishery index, and the finance group shall only acess the finance index. We can create roles that limit access to one index by copying the kibana4 role, giving it an appropriate name and changing the index:’*’ section to map to only the preferred index.

The final step involves mapping the Elasticsearch role into an AD role. This is done in the role_mapping.yml file, as mentioned above.

Only joking of course, that wasn’t the last step. The last step is restarting Elasticsearch, and checking the logs for failures as you try to log in.

Securing Elasticsearch

Shield brings enterprise authentication to Elasticsearch. You can easily manage access to various parts of Elasticsearch management and data by using Active Directory groups.

This has been a short dive into the possibilities, make sure to contact Comperio if you should need help in creating a solution with Elasticsearch and Shield.

Voting patterns at the Norwegian parliament

André Lynum — Thu, 30 Jul 2015 11:42:37 +0000

A couple of weeks ago we saw the blog post visualizing the voting patterns in the Polish parliament. In anticipation of the upcoming election and in the interest of checking up on our elected representatives we thought we would do a similar analysis for the Norwegian parliament. First we will visualize a projection of the voting data of the Norwegian parliament into 3D, and then we will try to interpret the axes of the projection in terms of the issues voted on.

The data

Stortinget, the Norwegian parliament, has a nice web service for retrieving details about the process an issue passes through in the parliament. Proposals, committees, referendums and so on; there is a surprising amount of detail in there. With a bit of work we pulled out the individual votes for the parliamentary referendums we were able to access. On the less bright side the data doesn’t appear to be complete as far as we can see, and there is also a range of documented caveats and restrictions regarding which votes are actually registered in the system. This reduces the usefulness of any detailed analysis based on this data, but we still think the aggregated picture one can create is both interesting and valid.

The voting data comes in the form of for, against or abstained votes. Like in the referenced blog post we would like to visualize the similarity in voting patterns for the representatives. In order to do this we constructed a grid of referendums versus representatives for the current parliamentary period with a single cell representing a vote encoded as 1 for “for”, -1 for “against” and 0 if “abstained” which makes this a neutral midpoint in the analysis we are going to do. This gives a data point for each representative in the multidimensional “vote space” where we can do things like clustering or similarity measures to find patterns in the votes.

Visualizing the vote space

Since we have the records for 58 referendums so far in the 2013-2017 period and 612 records in the 2009-2013 period we cannot directly visualize our representatives in the vote space. There are several techniques for projecting data like this into lower dimensional sub-spaces while still retaining the essential characteristics of the data. In this blog post we will project the voting data into 3 dimensions using Principal Components Analysis. There are several alternatives here and we chose PCA for the following reasons.

Our data is categorical, but non-binary. It is also distributed symmetrically with 0 representing a neutral value. One would expect that categorical data would not necessarily be modeled well by PCA, but the symmetry and the consistent values involved combined with the density of the data suggests that PCA should behave well.
PCA is based on Singular Value Decomposition (SVD) of the co-variance matrix and consequently has properties that are straightforward to interpret compared to methods such as Multidimensional Scaling (MDS) which seeks to preserve the individual distances between data points.
PCA gives us a projection along axes which have the most variance which in our case should highlight the parts of the data where there is less agreement between parties or representatives. This disagreement is the sort of contrast that we seek to visualize.
PCA axes are linear combinations of the separate referendums which means we can create a comprehensible interpretation of the visualization in terms of political issues.

Other methods are generally harder to interpret with regard to their applicability and the resulting projections (like probabilistic methods such as ICA) or don’t highlight the contrast we want to visualize (like MDS which preserves the distance relationships between the representatives).

The plots

We’re just going to present the plots without any political commentary. The first plot is for the current parliamentary period while the second is for the previous period. In the first we have highlighted Miljøpartiet De Grønne which only has a single representative in the current parliament. It is interesting to note the coherence of certain coalition partners in government and the location of the centrist parties. An important measure for the PCA projection is the explained variance ratio of the axes. This is the amount of variance that is preserved in the projected axes and is a measure of how much of the original information is retained in the projection. For our plots the total explained variance is ca. 62% and 57% which means there is quite a bit of diversity in the voting patterns that is not shown in the plot but still enough to consider them informative. The axes are ordered by explained variance so that the first axis explains over 40% of the variance while the remaining two around 10% each.

Projection of voting patterns in the 2013-2015 parliamentary periods.

Projection of voting patterns in the 2009-2013 parliamentary periods.

Interpreting the projection axes

In and of itself the colorful dots have a limited story to tell. If we could characterize what the axes in the graphs represent we could draw more interesting and detailed inferences. What we will do here is mostly for illustrative purposes though, since it really requires someone knowledgeable in Norwegian parliamentary politics and procedure to build an analysis with proper grounding in actual political activity. Here we will just play around with the data. It is tempting to treat the positive and negative directions of the axes as affirmative/negative on an issue but we would have to look at the text for each referendum to see what “for” and “against” actually means with regards to the sentiment on each issue. Actually the same issue tend to have both strong positive and negative components in a projected axis which strongly suggests that there isn’t a single sentiment expressed by the axis direction. Can we create a credible summary of the axes without closely studying each referendum and each issue? Each referendum is part of an issue and it turns out each issue has a list of topics associated with it. So we can make a summary of how much each topic is represented in an axis by the weight for each vote on this issue in a given axis. Each axis is a linear component of the votes and we add up the absolute value of each vote that is part of an issue concerning the topic. To class up this blog a bit we decided to make a word cloud out of the results.

Word cloud for PCA component 1 for the 2013-2015 plot.

Word cloud for PCA component 2 for the 2013-2015 plot.

Word cloud for PCA component 3 for the 2013-2015 plot.

The most characteristic topics

Some topics account for most of the activity in parliament and tend to dominate the data if we look only at the volume of votes. For the present parliamentary period the data is rather sparse so this isn’t as pronounced in the word clouds, but for the 2009-2013 period where there is a lot more data transportation/communication and health dominates in all the axes. Can we see what topics are characteristic for an axis instead? To highlight topics that have a high overall weight in our projection in comparison to their overall presence in the referendums we weigh each topic by its “Inverse Topic Frequency” – similar to the common Inverse Document Frequency (IDF) common in search relevance and information retrieval. This makes rarer but highly weighted topics stand out. This weighting gives us a clearer picture of how the axes differ from each other, even if they don’t necessarily show the ratio of influence of these topics on the projection itself.

Contrastive word cloud for PCA component 1 for the 2009-2013 plot.

Contrastive word cloud for PCA component 2 for the 2009-2013 plot.

Contrastive word cloud for PCA component 3 for the 2009-2013 plot.

Wrapping up

While the plots shown here are both interesting and entertaining there are many possibilities available when creating these visualizations and interpreting them. Consequently while they can be very helpful during analysis and point out interesting directions one might not have noticed otherwise, a purely quantitative analysis can fall prey to “researchers degrees of freedom” and steer conclusions towards predetermined biases and worse. External validation and domain expertise can help ground models and inferences in reality. With the help of a domain expert we could shape the data into a format more amenable for analysis, for example by weighting issues by importance and normalizing the vote sentiment across referendums. This would provide us with a much firmer foundation for making inferences than the raw data by itself as we have done here.

How Elasticsearch calculates significant terms

André Lynum — Wed, 10 Jun 2015 11:02:28 +0000

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

0 \\ 0 & elsewhere \end{matrix}\right. ' title=' JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right. ' class='latex' />

Here the is the frequency of the term in the foreground (or query) document set, while is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in .

By keeping fixed and keeping in mind that both it and is positive we get the following function plot. Note that is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when and approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

Furtunately the decreasing part of the function is in an area where and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as increases.

Looking at the level sets for the JLH score there is a quadratic relationship between the and . Solving for a fixed level we get:

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

An increase in must be offset by approximately a square root increase in to retain the same score.

As we see the score increases sharply as increases in a quadratic manner against . As becomes small compared to the growth goes from linear in to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between and which means once there is significant difference between the two the will dominate the score ranking. The factor primarily makes the score sensitive when this factor is small and for reasonable similar the decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.