Search Nuggets » Murhaf Fares

Bitbucket to Elasticsearch Connector

Murhaf Fares — Thu, 18 Sep 2014 11:46:16 +0000

“Ability to search source code? (BB-39)” is an issue created in July 2011 on Bitbucket and its status is still new. If you have used Bitbucket before, you would have certainly noticed that there is no way to search in a repository’s source code. Now what if you had more than 200 repositories (as is the case for Comperio) and you wanted to search for some examples on how to use a function, for example? There are two options. Either clone all the repos to your local machine and then do some ‘grep’ magic or use our connector to index Bitbucket content in elasticsearch and then search happily ever after.

In this blog post, we introduce an open-source and free connector that indexes content from Bitbucket in elasticsearch. The connector is written in Python and it has two main modes: index, indexes everything from your Bitbucket account in elasticsearch, and update, updates your elasticsearch index based on the commits from the last time your ran the connector (there are three types of git update: add, change and delete).
The connector creates an elasticsearch index (based on the configurations provided in elasticsearch.conf) which in turn has two types of documents, namely ‘file’ and ‘repo’. We only provide a mapping file for the ‘file-typed’ documents, you can create one for repos as well. For information on the connector and how to use it, please see the project’s page on GitHub.

Bitbucket REST APIs
If you check the source code of the connector, you will see that we are using two versions of Bitbucket REST APIs (version 1.0 and version 2.0). We are doing so because not everything supported by version 1.0 is supported by version 2.0 and vice versa, e.g. branches are retrievable in API V 1.0 but not 2.0.

Field collapsing for duplicates from different branches
If a repo has more than one branch, the connector would index the files in all branches as separate documents. This means that whenever you are searching for something, you will see the same matching file from the different branches as separate hits as well. In order to avoid this, we created an ID called collapse_id which allows us to collapse hits of the same file, but from different branches, using queries similar to the following:

See another example of field collapsing using the top hits aggregation on elasticsearch.org.

ELK in one (Vagrant) box

Murhaf Fares — Thu, 14 Aug 2014 14:06:18 +0000

In this blog post we introduce a Vagrant box to easily create configurable and reproducible development environments for ELK (Elasticsearch, Logastash and Kibana). At Comperio, we mainly use this box for query log analysis using the ELK stack.
In case you don’t know, Vagrant is a free and open-source software that combines VirtualBox (a virtualization software) with configuration management softwares such as Puppet and Chef.

ELK stack up and running in two commands

$ git clone https://github.com/comperiosearch/vagrant-elk-box.git
$ vagrant up

By cloning this github repo and then typing “vagrant up”, you will be installing elasticsearch, logstash, kibana and nginx (the latter used to serve kibana).

Elasticsearch will be running on port 9200, as usual, which is forwarded to the host machine. As for Kibana, it will be served on port 5601 (also accessible from the host OS).

How does it work?
As mentioned above, Vagrant is a wrapper around VirtualBox and some configuration management software. In our box, we use pure shell scripting and Puppet to configure the ELK stack.
There are two essential configuration files in this box: Vagrantfile and the Puppet manifest default.pp.
Vagrantfile includes the settings of the virtual box such as operating system, memory size, number of CPUs, forwarded ports, etc…

Vagrantfile also includes a shell script that installs, among other things, the official Puppet modules for elasticsearch and logstash. By using that shell script we stay away from git submodules which were used in another Vagrant image we made earlier for elasticsearch.

In the Puppet manifest, default.pp, we define what version of elasticsearch to install and make sure that it is running as a service.

We do the same for logstash and additionally link the default logstash configuration file to this file under /Vagrant/confs/logstash which is shared with the host OS. Finally, we install nginx and Kibana, and configure Kibana to run on port 5601 (by linking the nginx conf file to this file in the Vagrant directory also).

Bygger søkesystemer for næringslivet

Murhaf Fares — Thu, 14 Aug 2014 11:35:07 +0000

If you are a new graduate, or about to graduate, and wondering what is it like to be working with search technology at Comperio, read this interview with UiO (University of Oslo):

For eksempel kan jeg hjelpe et selskap med å lage bedre søk for å finne informasjon internt i organisasjonens datasystem. Det første jeg gjør er å sortere filene. Jeg «tagger» filene med relevant informasjon, som for eksempel hvem som har skrevet den eller hva slags prosjekt den tilhører. Problemet er ofte at man setter inn et ord for å beskrive en fil, og når man skal søke det opp bruker man et synonym. Slik informasjon er ikke alltid i filene, derfor må de behandles før de kan søkes opp.På den andre siden har du brukeren som vil finne filene.

Akkurat nå jobber jeg med å gjøre søkene bedre på brukersiden. Noen ganger vet ikke brukeren hva som er det beste nøkkelordet eller søkeordet for det hun eller han søker på. Da kan det for eksempel være nyttig å finne ut hvem som kan noe om temaet. Det vil si at vi kan linke søkeord med eksperter innenfor bedriften. Egentlig gjør vi søket lettere ved å tilgjengeliggjøre informasjon.

Les mer

Instant search with AngularJS and elasticsearch

Murhaf Fares — Thu, 24 Oct 2013 09:22:09 +0000

Join our breakfast seminar in Oslo November 28th

In this blog post we try to explain how to build a single-page application (SPA) to search in elasticsearch using AngularJS. For simplicity, we will use a twitter index which you can easily create yourself (see here and here).
Angular is an open-source JavaScript framework maintained by Google. According to the official Angular website “Angular teaches the browser new syntax through a construct we call directives”—in other words, Angluar allows you to extend the HTML syntax to make your application ‘more dynamic’. With Angular, we can also build applications following the Model-View-Control (MVC) design pattern.

To communicate with elasticsearch, we rely on elastic.js, a JavaScript client for elasticsearch.

In this post, we assume some basic understanding of search concepts and Angular. If you are not familiar with Angular, see the further readings section.

Sketching the structure out

Angular applications, in general, consist of three types of components: models, views and controllers, and our application is no exception.
The model in our application is, more or less, included within elastic.js, which sends requests to elasticsearch and returns the requested information as JSON objects. The views are almost plain HTML pages which allow the user to enter search queries and display the results of those queries. The controllers are mediators between elastic.js and the views. We have two views and two corresponding controllers for Twitter and Stackoverflow, but we will not be implementing the Stackoverflow part; we added it only for the sake of clarity.

The full code is on github, you can start by looking at the general structure of the application and then come back here to understand how the different parts of this application work together.

The single page in the single-page application

Single-page applications contain, as the name suggests, one page whose content changes dynamically. This page is called index.html in our search application. The following code snippet shows the main parts of index.html.

There are two important points to note about the code above. First, the Angluar directive np-app within the tag which defines the boundary of the Angular application. Second, the ng-view directive which specifies where the different views will be rendered within the layout of index.html. Said differently, ng-view is ‘replaced’ by the dynamic content in the main page.

searchApp.js

In each and every Angular application, there is at least one Angular module. “searchApp” is the sole and main module in our application, as indicated in ng-app above. The following piece of code defines this module with its dependencies.

The module searchApp depends on three other modules, namely: (1) elasticjs.service, the JavaScript client for elasticsearch, elastic.js, (2) ngSanitize, an Angular module to sanitize HTML, and (3) infinite-scroll, a directive to implement infinite scrolling in Angular.

We also need to configure the so-called routes, URLs linking to views and controllers. In concrete terms, we need to link the items in the navigation bar to HTML partials. To do so, we use the route service, $routeProvider. Observe that each route has a view, templateUrl, and a controller. Recall that we used the directive ng-view to define the dynamic part within the main page, $routeProvider and ng-view are used together to achieve this dynamic behavior.

Twitter view

The first thing we need in a search application is a search box, ha? The following code snippet defines an instant search box that calls a search function, search(), every time the input text is changed. This search() function will be defined later in the Twitter controller.

Observe that we bind the input text to a property named queryTerm using the Angular directive ng-model. This binding makes the input value available throughout the Angular application via an object called scope ($scope). It is noteworthy that binding in Angular is two-way, meaning that if the input text (queryTerm) is changed somewhere in the application this change will be directly reflected on the user interface and vice versa.
Now let’s dive into how the search function works, and then we can come back to displaying the search results.

Simple queries

All the search functionalities are defined within the Twitter controller, TwitterController.js. We start by implementing a simple search function and then gradually enhance it. First, we define a controller named Twitter, which depends on $scope and esjResource (the elastic.js client).

Within the twitter controller, we instantiate the elastic.js client, passing the URL of the local elasticsearch instance. All search requests to elasticsearch go through a Request object, statusRequest, which queries the specified index and type.
Observe that we are using the binding $scope.queryTerm to access the search query. We use the match query family in elasticsearch to search in the _all field for $scope.queryTerm. In addition, we define the fields to be returned by elasticsearch instead of returning full documents. Finally, the search results are appended to $scope.resultArr (again defined within the $scope to make accessible by the views).

With the code above, however, we only get the top ten results matching the user’s query, usually we want to show more than that, this is where pagination comes into play.

Pagination with infinite scroll

Each search request by elastic.js returns ten results by default, but the user’s information need might not be satisfied by the top 10 results, hence we want to allow the user to load more results if needed. The process of dividing the search results into different pages is called pagination. However, in this example, we will not implement the classic pagination, but instead we will use scrolling; whenever a user scrolls down and reaches the bottom of the page, a new search will be performed with an offset of pege_number*result_per_page and, consequently, a new set of results will be rendered.

The function show_more will be called whenever the user scrolls down and hits the end of the page. Every time the function show_more is called, it increases the page number, and then calls another search function, searchMore(). The only difference between searchMore() and search() is that the latter defines an offset, .from(offset).

Highlighting

Elasticsearch implements search result highlighting on one or more fields. In the following we define highlighting on the text field, and specify that for each hit the highlighted fragments should be in bold.

The code above defines a Highlight object, then passes it to the search request. No changes are required in the search function.

Viewing, again

Now that we have defined all the functions needed to send queries to elasticsearch, we can extend our view to display search results.

The directive infinite-scroll specifies that the function show_more() should be called on scroll-down. The directive ng-repeat is like an ‘iterative loop’ that repeats the HTML template for each element in the collection specified, e.g. the directive ng-repeat="doc in results.hits.hits" iterates over all the hits returned by elastic.js from elasticsearch. Note that the array resultArr is accessible from the Twitter view because it was defined on the $scope object in TwitterController.js. renderResult and renderResultMetadata are JavaScript functions defined in TwitterController.js, both function return HTML, hence we need to sanitize their returned values using the directive ng-bind-html. We didn’t explain the functions renderResult and renderResultMetadata as they are straightforward.

In future posts, we will go through faceting with elasticsearch and Angular. If you are curious, though, the code on github already contains faceting.

Join our breakfast seminar in Oslo November 28th

Elasticsearch smashtime

Murhaf Fares — Mon, 23 Sep 2013 10:08:47 +0000

Last week, Comperio went to the University of Oslo to give the informatics students a brief introduction to elasticsearch and Kibana, so we tweeted and indexed our tweets in addition to many other tweets.

Our tutorial, “smashtime”, is based on Simon Willnauer’s tutorial “hammertime” (hence the name), however we introduced some changes to make it meet our needs, e.g. hammertime uses stream2es to index tweets but we used elasticsearch Twitter River because the latter provides the ability to filter which tweets get indexed based on their hashtags, location etc.
Hammertime ‘automates’ everything, meaning that two bash scripts will do the work for you: run elasticsearch, index tweets, query, and start Kibana. In smashtime, however, we wanted the students to learn how to run elasticsearch and Kibana, so our bash script will only download and configure elasticsearch together with a couple of elasticsearch plugins.

Smashtime went fairly well, most students were able to join our cluster and query the twitter index.
We started the tutorial with indexing some ‘fake’ tweets using cURL requests. Then, we moved on to querying real tweets using match, phrase, and Boolean queries. After that, we showed how fast filters can be and how one can use facets to know who is tweeting on Justin Bieber, bah. Lastly, but perhaps most interestingly, we made sense of our tweets using Kibana, again, thanks to Simon’s hammertime dashboard.

In addition to introducing elasticsearch and Kibana, we used Sense, a very useful, JSON aware tool* to send queries to elasticsearch. Among other things, Sense has a great autocomplete feature, if you haven’t tried it before, check it out!

We are looking forward to more workshops and tutorials on elasticsearch. We know, for Search.

*Sense is actually a chrome extension, but if you clone the repo you should be able to run it in any browser and you can also install as an elasticsearch plugin.

Search Nuggets » Murhaf Fares

Bitbucket to Elasticsearch Connector

ELK in one (Vagrant) box

Bygger søkesystemer for næringslivet

Instant search with AngularJS and elasticsearch

Sketching the structure out

The single page in the single-page application

searchApp.js

Twitter view

Simple queries

Pagination with infinite scroll

Highlighting

Viewing, again

Further readings

Join our breakfast seminar in Oslo November 28th

Elasticsearch smashtime