Search Nuggets » Big Data

Impressions from Berlin Buzzwords 2015

Christoffer Vig — Mon, 08 Jun 2015 13:34:53 +0000

May 31 – June 3 2015

Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends.

The conference is focused on three core concepts – search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.

Comperio

Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” – a deep dive into how to utilize Elasticsearch built in indexes and APIs for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.

The talks

Many people attended the comparison of Solr and Elasticsearch Performance & Scalability with Radu Gheorghe & Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.

Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.

SQL?

Another theme threatening to return from the basement was how to properly support SQL style joins into search engines. Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.

Talking the talk

As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.

Hackathon

Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop’s MapReduce component. It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.

The buzz

Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.

Videos from most talks are available at youtube.com

Beyond significant terms

Algorithms and data-structures that power Lucene and Elasticsearch

Practical t-digest Applications

Talk the Talk: How to Communicate with the Non-Coder

Side by Side with Elasticsearch & Solr part 2

3 steg til Big Data

Christoffer Vig — Tue, 28 Apr 2015 13:00:09 +0000

Big data er tidens tredje hotteste buzzword, men ikke alle er klar over hva det er, hvor de kan finne det, eller hva man skal med det. Big Data er i ferd med å vokse frem under beina på de fleste av oss. Det digitale universet fordobles for annet hvert år som går. Internett, mobil og ikke minst tingenes internett genererer stadig mer informasjon.

Skal du lykkes i forretningslivet i dag, er du avhengig av å kjenne brukernes bevegelser og kunne tilpasse løsningen din etter dette. Du kan velge å stole på maktene, som Snåsamannen eller Märtha, eller du kan ta makten i din egen hånd og høste innsikten som ligger begravet i virksomhetens og brukernes logger.

3 steg

Vi tar utgangspunkt i at du har en nettside, og at du får tak i loggene til denne. I tillegg trenger du en datamaskin, samt en datakyndig person, helst en med utvikler-kompetanse.

Slik kommer du i gang:

Identifiser 3 målbare KPI’er.
Forslag: Sidevisninger pr. dag, Mest brukte spørreord, Responstid pr.side
Mat loggene inn i ELK.
Finn logdata og en utvikler. Utvikleren finner lett ut av dette.
Visualisér KPI’ene.
Hold fast i utvikleren, mens dere sammen ser på dataene i Kibana og finner passende grafisk fremstilling.

Eksempel på Kibana dashboard

KPI

Forslagene til KPIer er standard måletall for nettsider. Dette er tall som alle nettsideanalyseverktøy, som Google Analytics, kan gi deg i dag. Forskjellen er at nå er det du som setter sammen grafene og utvikler verktøyene, dataene tilhører deg, og måten du velger å sette informasjonen sammen på for å skape innsikt er helt opp til deg selv. Igjen; Hensikten her er å demonstrere en teknikk og vise fram et verktøy, ikke å fortelle deg hvilke KPIer du bør være opptatt av.

ELK

ELK , som nevnt over, eller den såkalte “ELK stacken”, tilbyr et komplett Big Data lagrings-, søk- og analyse-verktøy. ELK står for Elasticsearch, Logstash og Kibana, en samling open source produkter utviklet av teknologiselskapet Elastic. Søkemotoren Elasticsearch er kjernen i stacken, med fokus på utviklervennlighet og skalerbarhet. Logstash mater data inn i Elasticsearch, mens Kibana tilbyr ad-hoc data-analyse og nydelige visualiseringer og grafer.

Netflix, GitHub, Microsoft er eksempler på gigantvirksomheter som benytter Elasticsearch i kjernen av sin virksomhet.

Bakgrunnen til plattformens popularitet ligger i at den er enkel å starte med, samtidig som den leverer uovertrufne søke- og analyse-muligheter. ELK stacken nevnes ofte i samme åndedrag som Big Data, ettersom den takler større datamengder.

En start

Loggene til nettsiden din kvalifisere antakeligvis ikke helt til betegnelsen Big Data. Poenget er at verktøykassen vi introduserer her står du rustet til større oppgaver.

Du kan kan komme i gang med å ta makten over bedriftens datalogger uten at det krever store ressurser. Planen kan legges underveis, samtidig som enkel tilgang til rådata alene kan skape både ny innsikt og nye spørsmål og behov.

Søk og analyse av store datamengder, som f.eks. transaksjonslogger, nettverkstrafikk, brannmur, internett-aktivitet i stor skale, som twitter, irc, nettsider osv.

Det norske søketeknologiselskapet Comperio er partner med Elastic, og har mange utviklere som du kan hjelpe deg gjennom disse tre stegene. Comperio har jobbet med søk siden 2004 og er et av verdens ledende selskaper innen søketeknologi.

Ikke la Big Data skuta seile sin egen sjø, ta plass ved roret og sett kursen mot din egen Big Data horisont nå!

Les om Comperios frokostmøte om hvordan forstå kundene dine bedre.

Big Data and Enterprise Search

John Thompson — Wed, 27 Feb 2013 13:32:02 +0000

There have been a number of reports and papers issued recently on Big Data including:

Forrester reviewing the Big Data solutions of 2013
The Economist talks about what is Big Data and how can it be used
Wall Street Journal talk about what is next for Big Data
The Sunday Times reviewing how Big Data has helped various companies
Martin White discusses if there is a need for enterprise search whilst Big Data lives
Stephen Arnold discusses his Big Data trends for 2013
Mike Walsh reviews Big Data strategies for 2013

As well as software vendors pushing their appetite for Big Data, either via their websites e.g. Microsoft and their Big Data Week or via taking adverts in the UK national press i.e. IBM running a number of full page adverts in The Times, week commencing Monday 18^th February 2013.

From reading these reports and papers, what actually is Big Data? Is Big Data a hype? Is it only relevant for a small number of very large organisations where the volume, rate of change, variety and worth of this data is highly relevant? It is very hard to answer these questions – what may be Big Data to you may not be Big Data to me.

Big Data to me is being able to capture data, whether it is structured, semi-structured or totally unstructured, store it, interpret it, and leverage it to provide insights in order to help the business.

So how does Big Data and enterprise search co-exist? Can traditional search tools work as the “gateway” to explore Big Data by, for instance, preparing the data to help in creating the predictive model?

Based upon the Forrester report mentioned above, SAS and IBM are the leaders in the Big Data space with a large number of tools available to process and analyse the data. As the first part of any analysis is preparing the data, and with a large proportion of the data being unstructured, could enterprise search use its distinctive capabilities of pre-processing large amounts of both structured and unstructured content – I think not. Currently, enterprise search tools do not have the capability to traverse some a large amount of data in a timely manner in order to try and produce a smaller, relevant content set without using a very large amount of hardware. The structured data that exists may already be well organised but unstructured content is another matter and trying to interpret meaning between structured and unstructured content can be very complex.

So, if enterprise search cannot help directly with preparing the data for analysis, where can search help? Frost and Sullivan forecast the global enterprise search market to be US$4.68bn by 2019. Search has the capability to bring a wide variety of different content sources together and produce meaning. Search queries, from simple to complex, can then be run against the search index returning, hopefully the most relevant content based upon the search query terms entered. But, for the most part, this won’t give the answers to the Big Data questions e.g. – helping to uncover the answers to making the best use of the data available.

However, Search Based Applications (SBA), as defined by Wikipedia , “use semantic technologies to aggregate, normalize and classify unstructured, semi-structured and/or structured content across multiple repositories, and employ natural language technologies for accessing the aggregated information.”, can be built to slice and dice the information in the search index on-the-fly – isn’t this close to what the Big Data engines are trying to achieve. There are a number of search related companies building SBAs which look to build insight in the realms of data that organisations amass.

There are obviously limitations to what search engines can do in terms of the size of the data sets – that’s why it’s called Big Data. However, there must be a reason why companies like IBM purchased Vivisimo in 2012 or Oracle bought Endeca in 2011, with both companies looking to capitalize on the capabilities both Vivisimo and Endeca offered in terms of unlocking the structured and unstructured content within organisations.

Quote from Oracle on the Endeca acquisition, “Oracle with Endeca plans to create a comprehensive technology platform to process, store, manage, search and analyze structured and unstructured information together enabling businesses to make stronger and more profitable decisions” – search and Big Data complementing each other – I think so.

Finally, the latest information from Gartner indicates “Big Data is forecast to drive $34 billion of IT spending in 2013 and create 4.4 million IT jobs by 2015, but it is currently still a solution looking for a problem”.