Search Nuggets » lucene

Impressions from Berlin Buzzwords 2015

Christoffer Vig — Mon, 08 Jun 2015 13:34:53 +0000

May 31 – June 3 2015

Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends.

The conference is focused on three core concepts – search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.

Comperio

Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” – a deep dive into how to utilize Elasticsearch built in indexes and APIs for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.

The talks

Many people attended the comparison of Solr and Elasticsearch Performance & Scalability with Radu Gheorghe & Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.

Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.

SQL?

Another theme threatening to return from the basement was how to properly support SQL style joins into search engines. Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.

Talking the talk

As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.

Hackathon

Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop’s MapReduce component. It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.

The buzz

Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.

Videos from most talks are available at youtube.com

Beyond significant terms

Algorithms and data-structures that power Lucene and Elasticsearch

Practical t-digest Applications

Talk the Talk: How to Communicate with the Non-Coder

Side by Side with Elasticsearch & Solr part 2

Elastic{ON}15: Day two

Christoffer Vig — Thu, 19 Mar 2015 20:59:41 +0000

March 11, 2015

Keynote

Fighting the crowds to find a seat for the keynote at Day 2 at elastic{ON}15 we were blocked by a USB stick with the curious caption Microsoft (heart) Linux. Things have certainly changed.

Microsoft

The keynote, led by Elastic SVP of sales Aaron Katz, included Pablo Castro of Microsoft who was keen to explain how this probably isn’t so far from the truth. Elasticsearch is used internally in several Microsoft products among Linux and other open source software and this is a huge change from the Microsoft we know from around five years ago. Pablo revealed some details towards how elasticsearch is used as data storage and search platform in MSN, Microsoft Dynamics and Azure Search. Microsoft truly has gone through some fundamental changes lately embracing open source both internally and externally. We see this as a demonstration of the power of open source and the huge value of Elastic(search) brings to many organizations. As Jordan Sissel said in the keynote yesterday “If a user has a problem, it is a bug”. This is a philosophical stance towards a conception of software as an enabler of creativity and growth, in contrast to viewing software as a fixed product packaged for sale.

Goldman Sachs

Microsofts contribution was in the middle part of the keynote. The first part was a discussion with Don Duet, managing director of Goldman Sachs. Goldman Sachs provides financial services on a global scale, and has been on the forefront of technology since its inception in 1869. They were an early adopter of Elasticsearch since it was as an easy to use search and analytics tool for big data. Goldman Sachs is now using elasticsearch extensively as a key part of their technological stack.

NASA

The most mind blowing part of the keynote was the last one held by two chaps from the Jet Propulsion Labs team at NASA, Ricky Ma and Don Isla. They first showed their awesome internal search with previews, and built in rank tuning. Then they talked about the Mars Curiosity rover, a robot planted on Mars which runs around taking samples and selfies. It constantly sends data back to earth where the JPL team analyzes the operations of the rover. Elasticsearch is naturally at the center of this interplanetary operation, nothing less.

Mars Curiosity Rover Selfie

The remainder of the day contained sessions across the same three tracks as the first day. In addition five tracks of birds of a feather or “lounge” sessions were held where people gathered in smaller groups to discuss various topics. Needless to say the breadth of the program meant we were stretched thin. We chose to focus on three topics that are of particular importance to our customers: aggregations, security & Shield, and resiliency

More aggregations

Adrien Grand & Colin Goodheart-Smithe did a deep dive into the details of aggregations and how they are computed. In particular how to tune them and the results in terms of execution complexity. A key point is the approximations that are employed to compute some of the aggregations which involve certain trade offs in speed over accuracy. Aggregations are a very powerful feature requiring some some planning to be feasible and efficient.

Security/Shield

Uri Boness talked about Shield and the current state of authentication & authorization, He provided some pointers to what is on the roadmap for the coming releases. Unfortunately, there does not appear to be any concrete plans for providing built in document level security. This is a sought after feature that would certainly make the product more interesting in many enterprise settings. Then again, there are companies who provide connector frameworks that include security solutions for elasticsearch. We had a chat with some of them at the conference, including Enonic, SearchBlox and Search Technologies.

Facebook

Peter Vulgaris from Facebook explained how they are using elasticsearch. To me, the story resembled Microsoft’s. Facebook has heaps of data, and lots of use cases for it. Once they started to use elasticsearch it was widely adopted in the company and the amount of data indexed grew ever larger which forced them to think more closely about how they manage their clusters.

Resiliency

Elasticsearch is a distributed system, and as such shares the same potential issues as other distributed systems. Boaz Leskes & Igor Motov explained the measures that have been undertaken in order to avoid problems such as “split-brain” syndrome. This is when a cluster is confused about what node should be considered the master. Data safety and security are important features of Elasticsearch and there is a continuous effort in place in these areas.

Lucene

Stepping back to day 1 and the Lucene session featuring the mighty Robert Muir, we learned that Lucene version 5 includes a lot of improvements. Especially performance wise regarding compression both on indexing and query times which enables faster execution times and reduces resource consumption. There has also been made efforts to the Lucene core enabling a merging of query and filter as two sides of the same coin. After all a query is just a filter with a relevance score. On another note Lucene will now handle caching of queries by itself.

Wrapping it up

Elastic{ON}15 stands as a confirmation of the attitude that were essential in the creation of the elasticsearch project. The visions that guided the early development are still valid today, except the scale is larger. The recent emphasis on stability, security and resiliency will welcome a new wave of users and developers.

At the same time there is a continuous exploration and development into big data related analytics but with the speed and agility we have come to expect from Elasticsearch.

Thanks for this year, looking forwards to next!

Norch- a search engine for node.js

Fergus McDowall — Fri, 05 Jul 2013 13:24:02 +0000

*****
UPDATE 10th Sept 2013: Norch is now known as Forage- read about this change here
*****

Norch is a search engine written for Node.js. Norch uses the Node search-index module which is in turn written using the super fast levelDB library that Google open-sourced in 2011.

The aim of Norch is to make a simple, fast search server, that requires minimal configuration to set up. Norch sacrifices complex functionality for a limited robust feature set, that can be used to set up a freetext search engine for most enterprise scenarios.

Currently Norch features

Full text search
Stopword removal
Faceting
Filtering
Relevance weighting (tf-idf)
Field weighting
Paging (offset and resultset length)

Norch can index any data that is marked up in the appropriate JSON format

Download the first release of Norch (0.2.1) here