Search Nuggets » English

Experimenting with Open Source Web Crawlers

Mridu Agarwal — Fri, 29 Apr 2016 11:03:42 +0000

Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site, web scraping has many uses.

In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not only easily available but quite powerful as well. In this article I am mostly going to cover their basic features and how easy they are to start with.

If you are like one of those persons who likes to quickly get started while learning something, I would suggest that you try OpenWebSpider first.

It is a simple web browser based open source crawler and search engine which is simple to install and use and is very good for those who are trying to get acquainted to web crawling . It stores webpages in MySql or MongoDb. I used MySql for my testing purpose. You can follow the steps here to install it. It’s pretty simple and basic.

So, once you have installed everything , you just need to open a web-browser at http://127.0.0.1:9999/ and you are ready to crawl and search. Just check your database settings, type the Url of the site you want to crawl and within couple of minutes, you have all the data you need. You can even search it going to the search tab and typing in your query. Whoa! That was quick and compact and needless to say you don’t need any programming skills to crawl it.

If you are trying to create an off-line copy of your data or your very own mini Wikipedia, I think go for this as it’s the easiest way to do it.

Following are some screen shots:

You can also see the this Search engine demo here, before actually getting started.

Ok, after getting my hands on into web crawling, I was curious to do more sophisticated stuff like extracting topics from a web site where I do not have any RSS feed or API. Extracting this structured data could be quite important to many business scenarios where you are trying to follow competitor’s product news or gather data for business intelligence. I decided to use Scrapy for this experiment.

The good thing about Scrapy is that it is not only fast and simple, but very extensible as well. While installing it on my windows environment, I had few hiccups mainly because of the different compatible version of python but in the end, once you get it, it’s very simple(Isn’t that how you feel anyways , once things works ? Anyways, forget it! :D). Follow these links, if you are having trouble installing Scrapy like me:

https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment

http://doc.scrapy.org/en/latest/intro/install.html#intro-install

After installing, you need to create a Scrapy project. Since we are doing more customized stuff than just crawling the entire website, this requires more effort and knowledge of programming skills and sometime browser tools to understand the HTML DOM. You can follow this link to get started with you first Scrapy project .Once you have crawled the data that you need, it would be interesting to feed this data into a search engine. I have also been looking for open source web crawlers for Elastic Search and this looked like the perfect opportunity. Scrapy provides integration with Elastic Search out of the box , which is awesome. You just need to install the Elastic Search module for Scrapy(of course Elastic Search should be running somewhere) and configure the Item Pipeline for Scrapy. Follow this link for the step by step guide. Once done, you have the fully integrated crawler and search system!

I crawled http://primehealthchannel.com and created an index named “healthitems” in Scrapy.

To search the elastic search index, I am using Chrome extension Sense to send queries to Elastic Search, and this is how it looks

GET /scrapy/healthitems/_search

I hope you had fun reading this and now wants to try some of your own cool ideas . Do let us know how you used it and which crawler you like the most!

Content Enrichment Web Service SharePoint 2013 – Advantages and Challenges

Mridu Agarwal — Tue, 26 Apr 2016 11:23:22 +0000

If you have worked with search solutions before, you will know that very often there is a need to process data before it can be displayed in search results. This processing might be required to address some of(but not limited to) these common issues:

Missing metadata issues
Inconsistent metadata issues
Cleansing of content
Integration of semantic layers/Automatic tagging
Integration with 3rd party service
Merging data from other sources

Content Enrichment Web Service in SharePoint 2013 is a SOAP-based service within the content processing component that can be used to achieve this. The figure below shows a part of the process that takes place in the content processing component of SharePoint search.

Content Enrichment Web Service SharePoint 2013 combines the goodness of both FAST for SharePoint Search and SharePoint Search to offer a whole new set of possibilities and has its own challenges. To see an implementation example, check the MSDN link which pretty much sums up the basic steps. In this post we are going to look at some of the advantages and challenges of CEWS coming from a FAST 2010 background:

1. CEWS is a service and you DON’T have to deploy it in your SharePoint environment: Perhaps this is the biggest architectural change from the content processing perspective. What this means is that your code no longer runs in a sandbox environment within SharePoint Server. The webservice can be hosted anywhere outside your SharePoint server thus reducing deployment headaches and huge number of approvals required to deploy the executable files. I can see operations/infrastructure team/administrators smiling.

2.The web service processes and returns managed properties, not crawled properties: Managed properties correspond to what actually gets indexed and displayed in search results. So, this reduces some of the confusion as why I cant see the updated results( perhaps you had forgotten to map your crawled property to a managed property and wait you will have to index it AGAIN!). Nightmare!

3. You can define a trigger to limit the set of items that are processed by the web service: In FAST 2010, each item had to pass through the pipeline whether you wanted to process it or not. This check had to be done in the code. Trigger in 2013 will allow us to define this check outside the code so that only for selected content, web service is called. This will optimize the overall performance and improve crawling time, if you only want to process a subset of the content.

So far, so good! But.. there are certain challenges we need to look at and see how we can overcome it. In fact, this is the most important part when you are architecting your CEWS solution:

1. The content enrichment callout step can only be configured with a single web service endpoint : Now this sounds very limiting. I have multiple search applications and earlier I maintained the logic in different solutions. Do I need to combine them all into a single service? What about the maintenance and change request? Well there are several possible technologies one could consider to solve this but what I did in my project was to create a WCF routing service and let the routing service handle my multiple web services based on filters. You could also use it to implement load-balancing and fault tolerance. Here in the following example, I have two content sources “xmlfile” and “EpiFileShare”. I want to have two different services “xmlsvc” and “episvc” to process these different sources. This is how I will configure the end points in my WCF Routing Service: 2. Only one condition can be configured for Trigger. Different search application will require different triggers: Now, this can again be solved by using WCF routers and filters and configuring separate endpoints for separate triggers. Here I am using default managed property “ContentSource” as a trigger/filter to determine my service endpoint. To summarize, I have shown some of the advantages and challenges of the new CEWS architecture in SharePoint 2013 search and how you can overcome it. Hope that now you want to try this soon and share your experience with us.

ELK stack deployment with Ansible

Christoffer Vig — Thu, 26 Nov 2015 09:59:38 +0000

As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way.

Ansible is a free software platform for configuring and managing computers, and I’ve been using it a lot lately to manage the ELK stack. Elasticsearch, Logstash and Kibana.

I can define a list of servers I want to manage in a YAML config file – the so called inventory:

[elasticearch-master]
es-master1.mydomain.com
es-master2.mydomain.com
es-master3.mydomain.com

[elasticsearch-data]
elk-data1.mydomain.com
elk-data2.mydomain.com
elk-data3.mydomain.com

[kibana]
kibana.mydomain.com

And define the roles for the servers in another YAML config file – the so called playbook:

- hosts: elasticsearch-master
  roles:
    - ansible-elasticsearch

- hosts: elasticsearch-data
  roles:
    - ansible-elasticsearch

- hosts: logstash
  roles:
    - ansible-logstash

- hosts: kibana
  roles:
    - ansible-kibana

Each group of servers may have their own files containing configuration variables.

elasticsearch_version: 2.1.0
elasticsearch_node_master: false
elasticsearch_heap_size: 1000G

Ansible is used for configuring the ELK stack vagrant box at https://github.com/comperiosearch/vagrant-elk-box-ansible, which was recently upgraded with Elasticsearch 2.1, Kibana 4.3 and Logstash 2.1

The same set of Ansible roles can be applied when the configuration needs to move into production, by applying another set of variable files with modified host names, certificates and such. The possible ways to do this are several.

How does it work?

Ansible is agent-less. This means, you do not install anything (an agent) on the machines you control. Ansible needs only to be installed on the controlling machine (Linux/OSX) and connects to the managed machines (some support for windows, even) using SSH. The only requirement on the managed machines is python.

Happy ansibling!

Elasticsearch: Shield protected Kibana with Active Directory

Christoffer Vig — Fri, 21 Aug 2015 14:26:45 +0000

Elasticsearch easily stores terabytes of data, but how can you make sure users only see the data they should? This post will explore how to use Shield, a plugin for Elasticsearch, to authenticate users with Active Directory.

Elasticsearch will by default allow anyone access to all data. The Shield plugin allows locking down Elasticsearch using authentication from the internal esusers realm, Active Directory (AD) or LDAP . Using AD, you can map groups defined in your Windows domain to roles in Elasticsearch. For instance, you can allow people in the Fishery department access only to fish-indexes, and give complete control to anyone in the IT department.

To use Shield in production, you have to buy an Elasticsearch subscription, however, you get a 30-day trial when installing the license manager. So let’s hurry up and see how this works out in Kibana.

In this post, we will install Shield and connect to Active Directory (AD) for authentication. After having made sure we can authenticate with AD, we will add SSL encryption everywhere possible. We will add authentication for the Kibana server using the built in authentication realm esusers, and if time allows at the end, we will create two user groups, each with access to its own index, and check how it all looks when accessed in Kibana 4.

Prerequisites

You will need a previously installed Elasticsearch and Kibana. The most recent versions should work, I have used Elasticsearch 1.7 and Kibana 4.1.1 If you need a machine to test on, I can personally recommend the vagrant-elk-box you can find here: The following guide assumes the file locations of the vagrant-elk-box, if you install differently, you will probably know where to look. Ask an adult for help.

For Active Directory, you need to be on a domain that uses Active Directory. That would probably mean some kind of Windows work environment.

Installing Shield

If you’re on the vagrant box you should begin the lesson by entering the vagrant box using the commands

vagrant up
vagrant ssh

Install the license manager

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/license/latest

Install Shield

 sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/shield/latest

Restart elasticsearch. (service elasticsearch restart)

Check out the logs, you should find some information regarding when your Shield license will expire (logfile location: /var/log/elasticsearch/vagrant-es.log)

Integrating Active Directory

The next step involves figuring out a thing or two about your Active Directory configuration. First of all you need to know the address. Now you need to be on your windows machine, open cmd.exe and type

set LOGONSERVER

The name of your AD should pop back. Add a section similar to the following into the elasticsearch.yml file (at /etc/elasticsearch/elasticsearch.yml)

shield.authc.realms:
  active_directory:
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

Type in the address to your AD in the url: field (where it says url: ldap://ad.superdomain.com). If your logonserver is ad.cnn.com, you should type in url: ldap://ad.cnn.com

Also, you need to figure out your domain name and type it in correctly.

NB: Be careful with the indenting! Elasticsesarch cares a lot about correct indenting, and may even refuse to start without telling you why if you make a mistake.

Finding the Correct name for the Active Directory group

Next step involves figuring out the name for the Group you wish to grant access to. You may have called your group “Fishermen”, but that is probably not exactly what it’s called in AD.

Microsoft has a very simple and nice tool called Active Directory Explorer . Open the tool and enter the adress you just found from the LOGONSERVER (remember? it’s only 10 lines above)

You may have to click and explore a little to find the groups you want. Once you find it, you need the value for the “distinguishedName” attribute. You can double click on it and copy out from the “Object”.

This is an example from my AD

CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com

Now this value represents a group which we want to map to a role in elasticsearch.

Open the file /etc/elasticsearch/shield/role-mapping.yml. It should look similar to this

# Role mapping configuration file which has elasticsearch roles as keys
# that map to one or more user or group distinguished names

#roleA:   this is an elasticsearch role
#  - groupA-DN  this is a group distinguished name
#  - groupB-DN
#  - user1-DN   this is the full user distinguished name
power_user:
  - "CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com"
#user:
# - "cn=admins,dc=example,dc=com" 
# - "cn=John Doe,cn=other users,dc=example,dc=com"

I have uncommented the line with “power_user:” and added a line below containing the distinguishedName from above.

By restarting elasticsearch, anyone in the “Rolle IT” group should now be able to log in (and nobody else (yet)).

To test it out, open http://localhost:9200 in your browser. You should be presented with a login box where you can type in your username/password. In case of failure, check out the elasticsearch logs (at /var/log/elasticsearch/vagrant-es.log).

If you were able to log in, that means Active Directory authentication works. Congratulations! You deserve a refreshment. Some strong coffee, will go down well with the next sections, where we add encrypted communications everywhere we can.

SSL - Elasticsearch

Authentication and encrypted communication go hand in hand. Without SSL, username and password is transferred in plaintext on the wire. For this demo we will use self-signed certificates. Keytool comes with Java, and is used to handle certificates for Elasticsearch. The following command will generate a self-signed certficate and put it in a JKS file named self-signed.jks. (swap out $password with your preferred password)

keytool -genkey -keyalg RSA -alias selfsigned -keystore self-signed.jks -keypass $password -storepass $password -validity 360 -keysize 2048 -dname "CN=localhost, OU=orgUnit, O=org, L=city, S=state, C=NO"

Copy the certificate into /etc/elasticsearch/

Modify /etc/elasticsearch/elasticsearch.yml by adding the following lines:

shield.ssl.keystore.path: /etc/elasticsearch/self-signed.jks
shield.ssl.keystore.password: $password
shield.ssl.hostname_verification: false
shield.transport.ssl: true
shield.http.ssl: true

(use the same password as you used when creating the self-signed certificate )

Restart Elasticsearch again, and watch the logs for failures.

Try to open https://localhost:9200 in your browser (NB: httpS not http)

https://localhost:9200

You should a screen warning you that something is wrong with the connection. This is a good sign! It means your certificate is actually working! For production use you could use your own CA or buy a proper certificate, which both will avoid the ugly warning screen.

SSL – Active directory

Our current method of connecting to Active Directory is unencrypted – we need to enable SSL for the AD connections.

1. Fetch the certificate from your Active Directory server (replace ldap.example.com with the LOGONSERVER from above)

echo | openssl s_client -connect ldap.example.com:6362>/dev/null| openssl x509 > ldap.crt

2. Import the certificate into your keystore (located at /etc/elasticsearch/)

keytool -import-keystore self-signed.jks -file ldap.crt

3. Modify AD url in elasticsearch.yml
change the line

url: ldap://ad.superdomain.com

url: ldaps://ad.superdomain.com

Restart elasticsearch and check logs for failures

Kibana authentication with esusers

With Elasticsearch locked down by Shield, it means no services can search or post data either. Including Kibana and Logstash.

Active Directory is great, but I’m not sure I want to use it for letting the Kibana server talk to Elasticsearch. We can use the Shield built in user management system, esusers. Elasticsearch comes with a set of predefined roles, including roles for Logstash, Kibana4 server and Kibana4 user. (/etc/elasticsearch/shield/role-mapping.yml on the vagrant-elk box if you’re still on that one).

Add a new kibana4_server user, granting it the role kibana4_server, using this command:

cd /usr/share/elasticsearch/bin/shield  
./esusers useradd kibana4_server -p secret -r kibana4_server

Adding esusers realm

The esusers realm is the default one, and does not need to be configured if that’s the only realm you use. Now since we added the Active Directory realm we must add another section to the elasticsearch.yml file from above.

It should end up looking like this

shield.authc.realms:
  esusers:
    type: esusers
    order: 0
  active_directory:
    order: 1
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com

The order parameter defines in what order elasticsearch should try the various authentication mechanisms.

Allowing Kibana to access Elasticsearch

Kibana must be informed of the new user we just created. You will find the kibana configuration file at /opt/kibana/config/kibana.yml.

Add in the username and password you just created. You also need to change the address for elasticsearch to using https

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: "https://localhost:9200"

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibana4_server
kibana_elasticsearch_password: secret

Restart kibana and elasticsearch, and watch the logs for any errors. Try opening Kibana at http://localhost:5601, type in your login and password. Provided you’re in the group you gave access earlier, you should be able to login.

Creating SSL for Kibana

Once you have enabled authorization for Elasticsearch, you really need to set SSL certificates for Kibana as well. This is also configured in kibana.yml

verify_ssl: false
# SSL for outgoing requests from the Kibana Server (PEM formatted)
ssl_key_file: "kibana_ssl_key_file"
ssl_cert_file: "kibana_ssl_cert_file"

You can create a self-signed key and cert file for kibana using the following command:

openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes

Configuring AD groups for Kibana access

Unfortunately, this part of the post is going to be very sketchy, as we are desperately running out of time. This blog is much too long already.

Elasticsearch already comes with a list of predefined roles, among which you can find the kibana4 role. The kibana4 role allows read/write access to the .kibana index, in addition to search and read access to all indexes. We want to limit access to just one index for each AD group. The fishery group shall only access the fishery index, and the finance group shall only acess the finance index. We can create roles that limit access to one index by copying the kibana4 role, giving it an appropriate name and changing the index:’*’ section to map to only the preferred index.

The final step involves mapping the Elasticsearch role into an AD role. This is done in the role_mapping.yml file, as mentioned above.

Only joking of course, that wasn’t the last step. The last step is restarting Elasticsearch, and checking the logs for failures as you try to log in.

Securing Elasticsearch

Shield brings enterprise authentication to Elasticsearch. You can easily manage access to various parts of Elasticsearch management and data by using Active Directory groups.

This has been a short dive into the possibilities, make sure to contact Comperio if you should need help in creating a solution with Elasticsearch and Shield.

How Elasticsearch calculates significant terms

André Lynum — Wed, 10 Jun 2015 11:02:28 +0000

The magic of the “uncommonly common”.

Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like “magic” and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in – garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.

The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.

0 \\ 0 & elsewhere \end{matrix}\right. ' title=' JLH = \left\{\begin{matrix} (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} & p_{fore} - p_{back} > 0 \\ 0 & elsewhere \end{matrix}\right. ' class='latex' />

Here the is the frequency of the term in the foreground (or query) document set, while is the term frequency in the background document set which by default is the whole index.

Expanding the formula gives us the following which is quadratic in .

By keeping fixed and keeping in mind that both it and is positive we get the following function plot. Note that is unnaturally large for illustration purposes.

On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.

The gradient of the function:

Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when and approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.

Furtunately the decreasing part of the function is in an area where and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around we also see that the entire area where the score is below zero is in this region.

With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as increases.

Looking at the level sets for the JLH score there is a quadratic relationship between the and . Solving for a fixed level we get:

Where the negative part is outside of function definition area.
This is far easier to see in the simplified formula.

An increase in must be offset by approximately a square root increase in to retain the same score.

As we see the score increases sharply as increases in a quadratic manner against . As becomes small compared to the growth goes from linear in to squared.

Finally a 3D plot of the score function.

So what can we take away from all this? I think the main practical consideration is the squared relationship between and which means once there is significant difference between the two the will dominate the score ranking. The factor primarily makes the score sensitive when this factor is small and for reasonable similar the decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.

The results and visualizations in this blog post is also available as an iPython notebook.

Impressions from Berlin Buzzwords 2015

Christoffer Vig — Mon, 08 Jun 2015 13:34:53 +0000

May 31 – June 3 2015

Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends.

The conference is focused on three core concepts – search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.

Comperio

Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” – a deep dive into how to utilize Elasticsearch built in indexes and APIs for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.

The talks

Many people attended the comparison of Solr and Elasticsearch Performance & Scalability with Radu Gheorghe & Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.

Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.

SQL?

Another theme threatening to return from the basement was how to properly support SQL style joins into search engines. Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.

Talking the talk

As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.

Hackathon

Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop’s MapReduce component. It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.

The buzz

Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.

Videos from most talks are available at youtube.com

Beyond significant terms

Algorithms and data-structures that power Lucene and Elasticsearch

Practical t-digest Applications

Talk the Talk: How to Communicate with the Non-Coder

Side by Side with Elasticsearch & Solr part 2

Analyzing web server logs with Elasticsearch in the cloud

Christoffer Vig — Tue, 26 May 2015 21:12:34 +0000

Using Logstash and Kibana on Found by Elastic, Part 1

This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and securing connections. Part 2 will show how to configure Logstash to read from IIS log files, and how to use Kibana 4 to visualize web traffic. Originally published on the Elastic Blog

Getting the Bits

For this demo I will be running Logstash and Kibana from my Windows laptop.
If you want to follow along, download and extract Logstash 1.5.RC4 or later, and Kibana 4.0.2 or later from https://www.elastic.co/downloads.

Creating an Elasticsearch Cluster

Creating a new trial cluster in Found is just a matter of logging in and pressing a button. It takes a few seconds until the cluster is ready, and a screen with some basic information on how to connect pops up. We need the address for the HTTPS endpoint, so copy that out.

Configuring Logstash

Now, with the brand new SSL connection option in Logstash, connecting to Found is as simple as this Logstash configuration

input { stdin{} }

output {
  elasticsearch {
    protocol => http
    host => REPLACE_WITH_FOUND_CLUSTER_HOSTNAME
    port => "9243" # Check the port also
    ssl => true
  }

  stdout { codec => rubydebug }
}

Save the file as found.conf

Start up Logstash using

bin\logstash.bat agent --verbose -f found.conf

You should see a message similar to

Create client to elasticsearch server on `https://....foundcluster.com:9243`: {:level=>:info}

Once you see “Logstash startup completed” type in your favorite test term on the terminal. Mine is “fisk” so I type that.
You should see output on your screen showing what Logstash intends to pass on to elasticsearch.

We want to make sure this actually hits the cloud, so open a browser window and paste the HTTPS link from before, append /_search to the URL and hit enter.
You should now see the search results from your newly created Elasticsearch cluster, containing the favorite term you just typed in. We have a functioning connection from Logstash on our machine to Elasticsearch in the cloud! Congratulations!

Configuring Kibana 4

Kibana 4 comes with a built-in webserver. The configuration is done in a kibana.yml file in the config directory. Connecting to Elasticsearch in the cloud comes down to inserting the address of the Elasticsearch instance.

# The Elasticsearch instance to use for all your queries.
elasticsearch_url: `https://....foundcluster.com:9243`

Of course, we need to verify that this really works, so we open up Kibana on http://localhost:5601, select the Logstash index template, with the @timestamp data field as suggested, and open up the discover panel. Now, if there was less than 15 minutes since you inserted your favorite test term in Logstash (previous step), you should see it already. Otherwise, change the date range by clicking on the selector in the top right corner.

Locking it down

Found by Elastic has worked hard to make the previous steps easy. We created an Elasticsearch cluster, fed data into it and displayed in Kibana in less than 5 minutes. We must have forgotten something!? And yes, of course! Something about security. We made sure to use secure connections with SSL, and the address generated for our cluster contains a 32 character long, randomly generated list of characters, which is pretty hard to guess. Should, however, the address slip out of our hands, hackers could easily delete our entire cluster. And we don’t want that to happen. So let’s see how we can make everything work when we add some basic security measures.

Access Control Lists

Found by Elastic has support for access control lists, where you can set up lists of usernames and passwords, with lists of rules that deny/allow access to various paths within Elasticsearch. This makes it easy to create a “read only” user, for instance, by creating a user with a rule that only allows access to the /_search path. Found by Elastic has a sample configuration with users searchonly and readwrite. We will use these as starting point but first we need to figure out what Kibana needs.

Kibana 4 Security

Kibana 4 stores its configuration in a special index, by default named “.kibana”. The Kibana webserver needs write access to this index. In addition, all Kibana users need write access to this index, for storing dashboards, visualizations and searches, and read access to all the indices that it will query. More details about the access demands of Kibana 4 can be found on the elastic blog.

For this demo, we will simply copy the “readwrite” user from the sample twice, naming one kibanaserver, the other kibanauser.

Setting Access Control in Found:
# Allow everything for the readwrite-user, kibanauser and kibanaserver
- paths: ['.*']
conditions:
- basic_auth:
users:
- readwrite
- kibanauser
- kibanaserver
- ssl:
require: true
action: allow

Press save and the changes are immediately effective. Try to reload the Kibana at http://localhost:5601, you should be denied access.

Open up the kibana.yml file from before and modify it:

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibanaserver
kibana_elasticsearch_password: `KIBANASERVER_USER_PASSWORD`

Stop and start Kibana to effectuate settings.
Now when Kibana starts up, you will be presented with a login box for HTTP authentication.
Type in kibanauser as the username, and the password . You should now again be presented with the Discover screen, showing the previously entered favorite test term. Again, you may have to expand the time range to see your entry.

Logstash Security

Logstash will also need to supply credentials when connecting to Found by Elastic. We reuse permission from the readwrite user once again, this time giving the name “logstash”.
It is simply a matter of supplying the username and password in the configuration file.

output {
  elasticsearch {
    ….
    user => “logstash”,
    password => `LOGSTASH_USER_PASSWORD`
  }
}

Wrapping it up

This has been a short dive into Logstash and Kibana with Found by Elastic. The recent changes done in order to support the Shield plugin for Elasticsearch, Logstash and Kibana, make it very easy to use the secure features of Found by Elastic. In the next post we will look into feeding logs from IIS into Elasticsearch via Logstash, and visualizing the most used query terms in Kibana.

Enhancing Web Analytics with Search Analytics

Mridu Agarwal — Wed, 20 May 2015 12:43:27 +0000

Web Analytics is the process of measuring and analyzing web data to assess and improve the effectiveness of a website.Tracking and improving search (search analytics) is an important part of web analytics which is often forgotten by many site owners. Website search analytics should not be underestimated as it can provide valuable insights into what users are looking for or what they are not able to find on the site. Recently, I read somewhere about an organization which increased their conversion rates by just increasing the size of their search box and working on the searches with zero results. Therefore, measuring and analyzing search could be a very important aspect in improving website’s effectiveness.

In this post, I will show you 5 quick steps to get started with Search Analytics:

1. Get Search Logs: There are various tools you can use to track and analyze your search. You many choose any one of them to get started depending on your business domain and organizational policies. I am using Google Analytics as a tool for this post simply because not only it is very powerful but it is very easy to start with. You can just create a google account and get started for “free”. No infrastructure setup required. Place the JavaScript code in your website pages and you are ready to start measuring.

Please note that there are many other tools in the market for more specific purposes and with higher complexities. It really depends on what your needs are. Here, we will continue with Google Analytics.

2. Understand your Site Search Usage: Many people underestimate the use of search on their website. So, the first step is to measure how many people are using search. Even if 5-10% of your visitors are using search, it is not a bad figure (depending on your business domain and site setup). This could very well be the most used navigation on your website.

3. Analyze your Search Terms: Get a list of searched terms to start with. Most often, you will see that there is a pattern. Few terms are searched more than the others. It is important to analyze the top 10 searched terms individually. You might want to group the other terms (long tail) and analyze them separately.

Try conducting searches for these searched term yourself and see if you are satisfied with the results. Did you get what you were expecting as the number one result? If not, you might want to make changes to your site to improve your search results.

Some additional things to ponder:

Looking at the searched terms do you see what you were expecting? Are their unknown terms?For example, if you have a product support site you might expect users to search more for some newly launched products.Do you see the product name or numbers in your searched term report? Are people looking for product comparison?

If you have some unexpected terms in your top 10 searched terms, then you might want to consider adding additional content related to those terms.

4. Evaluate User Experience: Are users happy or frustrated by the time they leave the site? Can they find what they are looking for or are they leaving immediately after performing a search? This is the toughest part because you are not sitting with the user and you can make only as much sense as the reports could tell. But the good news is that there are some metrics to watch out for which can provide valuable insights related to user experience. Couple of these metrics is shown in the picture below.

Result Pageviews/Search tells you about the number of results user viewed for the search term. If the number is too high you know that it is taking too long for users to find what they are looking for and they will not be happy about it.

% Search Exits is equivalent to bounce rate in web analytics. This number tells you about the percentage of people who left immediately after performing a search for that term without clicking on any of the search results. We would want this number to be as low as possible.

It is also important to evaluate search terms that produced 0 results. Users will not be happy to find zero results for their searches. There is no out of box metric in Google Analytics to track this but there are various ways to get around it using Events or Custom variables.

5. Improve Search Experience: What would you do if you found that there is a 35% Search Exit for your top keyword or 20 Result Pageviews/Search? In some cases this might be because the term is not spelled correctly or simply because user is using a term which is not an exact match with your content. For example, in an intranet environment of an organization, an employee is searching for “vacation list” but not getting any hits because you have “holiday list” in your content. Here, you might consider adding synonyms or best bets for these frequently search terms. Adding spelling correction like “Did you mean” or providing “related searches” for results could further help improve user experience and keep the visitors engaged in your website. If there are zero results for a search term you might want to consider adding additional content as well.

To conclude, I would say that there are lot of ways in which you can improve your site search analytics but the important part is to get started. It is not as tough as it may sound and it is worth the effort considering the amount of valuable information you get and the direct insight into user’s mind.

Search: better user experience with one line of JavaScript

Espen Klem — Mon, 18 May 2015 13:59:01 +0000

What’s the cheapest trick you can do to get a better user experience on your search solution, and make your users do better search queries?

Add a small line of JavaScript in your template’s document ready function:

$("#MySearchBox").focus();

This will do two things for the user:

It’ll be easier to see the search box .
The user can start typing without having to click inside the search box.

Next issue is that most intranet and websites are more than just a search solution. Maybe you don’t want that much attention on the search box on your homepage. The solution is then to do this on your search result page.

This will make it easier for your users to enhance their search query when they’re not happy with the search result at hand.

Do you have any other examples on other quick fixes that could make an even better user experience for your search solution?

Using iPython notebooks and Pycharm together

André Lynum — Mon, 11 May 2015 11:12:24 +0000

IPython notebooks have become an indispensable tool for many Python developers. They are a reasonably good environment for interactive computing, can contain inline data visualisations and can be hosted remotely for sharing results or working together with other developers. In many academic environments and increasingly in industry IPython notebooks are used for data visualisation work and exploratory programming, depending on the IPython interactive environment for fast prototyping of ideas.

As nice an environment we have in IPython, I often wish for the features of a full-fledged IDE. Here at Comperio we use PyCharm a lot which has excellent code editing, semantic completion, a graphical debugger and efficient code navigaton capabilities. In this blog post I’m going to show how you can simultaneously work on code in both the IDE and IPython notebook or interactive shell while keeping the running notebook and IDE project in sync.

Hey, PyCharm already have IPython notebook integration. What about that? Personally I find that the IPython notebook integration in the latest PyCharm (version 4.0.6) still isn’t adequate for serious work. You get the the completion and code navigation from PyCharm, but editing and navigation is reduced to half a dozen buttons. Further some functionality such as debugging appears to be plainly non-functional. Regardless there are other very nice IDEs for Python such as Wing or Eclipse, and the approach here will work equally well with them.

This cunning recipe consists of two spicy ingredients, Both are neat tricks on their own, but together they form a smooth workflow bridging exploratory programming and more structured software engineering. We are going to:

Install our code as an editable Pip package.
Use the IPython autoreload extension to dynamically reload code.

So let’s get cooking!

Editable Pip packages

We are going to organise our code in a Python package and install it with Pip using the -e or —editable option. This installs the package as it is pointing to our project directory and that we are always importing the code that we are editing. We could also accomplish this with some hacking on sys.path or PYTHONPATH, but having our code available as a package is a lot more seamless. It makes sense to use virtualenv (or EPD/Anaconda environments) to isolate your system Python from your development packages.

First we create Python project in PyCharm, add source folder with setup.py defining a basic python package.

Then we create stub file with the following code in our python module and a folder for our notebooks.

def get_page():
    print "Don't know how to do this yet."

And we activate our virtualenv/Conda environment and run pip install -e .

Pip and Git: If you install your package with -e from a Git repository it may think that you want to install from Git even if you’re giving it a file path. This is usually not what you want when you’re developing since you would have to commit your code for the package to update itself. An ugly but practical way to avoid this behavior is to move the .git folder out of the way when installing the package.

%autoreloading code

Now to the important part which is dynamically loading the IDE project into our IPython notebook. Let’s first fire up the notebook.

Start iPython and create a notebook.

You have probably used reload(module) to update the Python environment at runtime. This hardly ever works for more than five minutes and results in an inscrutable mess of old new stuff in your modules and classes.There are however a bag of neat tricks taking care of at least the majority of the problems around reloading Python code or modules (see http://pyunit.sourceforge.net/notes/reloading.html), and the IPython developers has collected these into their autoreload extension. Let’s look at it in action.

Here we set up the autoreload module and import our stub function in the first two cells. In the third we run our function. We then change the function definition in the IDE and save the file.

def get_page():
    print "Hey I'm updated."

And when we run the function again in cell four the updated code is run.

From notebook to project and back

Combining editable Pip packages and the autoreload module we have a way to seamlessly load our project code in the notebook. When we are ready to move our exploratory programming back to project we can move our code over, import any new definitions and refine our implementation while using it in our notebook. In this way we can quickly move from noodling around in the notebook to developing and testing in the IDE and move back to the notebook to use our project code in further unstructured meanderings.

In the next post we will demonstrate this in more detail.