<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; Solr</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/solr/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Sitevision – förbättra söket med Nutch</title>
		<link>http://blog.comperiosearch.com/blog/2016/06/08/sitevision-forbattra-soket-med-nutch/</link>
		<comments>http://blog.comperiosearch.com/blog/2016/06/08/sitevision-forbattra-soket-med-nutch/#comments</comments>
		<pubDate>Wed, 08 Jun 2016 14:20:07 +0000</pubDate>
		<dc:creator><![CDATA[Jack Thorén]]></dc:creator>
				<category><![CDATA[Nutch]]></category>
		<category><![CDATA[Sitevision]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[CRM]]></category>
		<category><![CDATA[web crawler]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=4096</guid>
		<description><![CDATA[Ett av Sveriges mest populära CMS verktyg är Sitevision, som används kanske främst av stora statliga myndigheter och kommuner. Valet att använda sig av Sitevision hos dessa myndigheter och kommuner är nog att det är väldigt enkelt för redaktörer och sidansvariga att använda och att underhålla informationen på sina sidor. Detta i en miljö där [...]]]></description>
				<content:encoded><![CDATA[<p>Ett av Sveriges mest populära CMS verktyg är Sitevision, som används kanske främst av stora statliga myndigheter och kommuner. Valet att använda sig av Sitevision hos dessa myndigheter och kommuner är nog att det är väldigt enkelt för redaktörer och sidansvariga att använda och att underhålla informationen på sina sidor. Detta i en miljö där kanske den webbtekniska kunskapen inte är på samma nivå som hos ett större teknikföretag.</p>
<p>Men medans vi hyllar det enkla användargränssnittet så önskar vi att det gick att bygga bättre sökfunktionalitet. Visst kan du söka i en webbsajt, det går även att söka på andra webbsajter, om du har satt upp flera webbsajter inom samma system. Men om du vill söka i en webbsajt eller databas som finns någon annanstans då går det inte. Men detta är på väg att ändras. Sitevision introducerar snart webbkravlaren Nutch, en mycket avancerad webcrawler som bygger på Hadoop, som i sin tur är del av ett ramverk för att hantera mycket stora mängder data. Nutch tillsammans med Solr kommer att lyfta Sitevisions sök till nya höjder.</p>
<p><strong>Nedan är ett schema för hur en sajt-indexering skulle kunna se </strong><strong>ut:</strong></p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/06/Jack_blog011.png"><img class="alignnone  wp-image-4098" src="http://blog.comperiosearch.com/wp-content/uploads/2016/06/Jack_blog011.png" alt="Jack_blog01" width="605" height="287" /></a></p>
<p>1. ”Injector” tar alla webbadresser i nutch.txt filen och lägger till dem i ”CrawlDB&#8221;. Som är en central del av Nutch. CrawlDB innehåller information om alla kända webbadresser (hämta schema, hämta status, metadata, &#8230;).</p>
<p>2. Baserat på data från CrawlDB skapar ”Generator” en lista på vad som ska hämtas och placerar det i en nyskapad segment katalog.</p>
<p>3. Nästa steg, ”Fetcher” får de adresser som ska hämtas från listan och skriver det tillbaka till segment katalogen. Detta steg är vanligtvis den mest tidskrävande delen.</p>
<p>4. Nu kan ”Parser” behandla innehållet i varje webbsida och exempelvis utelämnar alla html-taggar. Om denna hämtning (crawl) är en uppdatering eller en utökning av en redan tidigare hämtning (t.ex. djup 3), skulle ”Updater” lägga till nya data till CrawlDB som ett nästa steg.</p>
<p>5. Före indexering måste alla länkar inverteras av ”Link Inverter”, som tar hänsyn till att inte antalet utgående länkar på en webbsida är av intresse, utan snarare antalet inkommande länkar. Detta är ganska likt hur Google Pagerank fungerar och är viktig för scoring funktion. De inverterade länkarna sparas i “Linkdb”.</p>
<p>6-7. Med hjälp av data från alla möjliga källor (CrawlDB, LinkDB och segment), skapar Indexer ett index och sparar det i Solr katalogen. För indexering, används det populära Lucene biblioteket. Nu kan användaren söka efter information om genomsökta webbsidor via Solr.</p>
<p><strong>Funktionalitet som följer med Nutch:</strong></p>
<ul>
<li>Indexering av externa källor</li>
<li>Automatisk kategorisering</li>
<li>Metadata</li>
<li>Textanalys</li>
<li>Utökad funktionalitet enkelt med plugins</li>
</ul>
<p><strong>Nyttiga länkar:</strong></p>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Apache_Nutch">https://en.wikipedia.org/wiki/Apache_Nutch</a></li>
<li><a href="http://wiki.apache.org/nutch/">http://wiki.apache.org/nutch/</a></li>
<li><a href="http://nutch.apache.org/">http://nutch.apache.org/</a></li>
<li><a href="http://www.sitevision.se/vara-produkter/sitevision.html">http://www.sitevision.se/vara-produkter/sitevision.html</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2016/06/08/sitevision-forbattra-soket-med-nutch/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analysing Solr logs with Logstash</title>
		<link>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/#comments</comments>
		<pubDate>Sun, 20 Sep 2015 22:00:00 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[analysis]]></category>
		<category><![CDATA[grok]]></category>
		<category><![CDATA[logs]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3934</guid>
		<description><![CDATA[Analysing Solr logs with Logstash Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you&#8217;re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK [...]]]></description>
				<content:encoded><![CDATA[<h1>Analysing Solr logs with Logstash</h1>
<p>Although I usually write about and work with <a href="http://lucene.apache.org/solr/">Apache Solr</a>, I also use the <a href="https://www.elastic.co/downloads">ELK stack</a> on a daily basis on a number of <a>projects.</a> If you&#8217;re not familiar with Solr, take a look at some of my <a href="http://blog.comperiosearch.com/blog/author/sebm/">previous posts</a>. If you need some more background info on the ELK stack, both <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christoffer</a> and <a href="http://blog.comperiosearch.com/blog/author/alynum/">André</a> have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.</p>
<p>As a little side note for the truly devoted Solr users, an ELK stack alternative exists with <a href="http://lucidworks.com/fusion/silk/">SiLK</a>. I highly recommend checking out Lucidworks&#8217; various blog posts on <a href="http://lucidworks.com/blog/">Solr and search in general</a>.</p>
<h2>Some background</h2>
<p>On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio&#8217;s search middleware application.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs.png"><img class="aligncenter size-medium wp-image-3942" src="http://blog.comperiosearch.com/wp-content/uploads/2088/09/search_logs-300x157.png" alt="Search Logs Dashboard" width="300" height="157" /></a><br />
Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.</p>
<h2>Lets get started</h2>
<p>I&#8217;m going to assume you already have a running Solr installation. You will, however, need to download <a href="https://www.elastic.co/products/elasticsearch">Elasticsearch</a> and <a href="https://www.elastic.co/products/logstash">Logstash</a> and unpack them. Before we start Elasticsearch, I recommend installing these plugins:</p>
<ul>
<li><a href="http://mobz.github.io/elasticsearch-head/">Head</a></li>
<li><a href="https://www.elastic.co/guide/en/marvel/current/_installation.html">Marvel</a></li>
</ul>
<p>Head is a cluster health monitoring tool. Marvel we&#8217;ll only need for the bundled developer console, Sense. To disable Marvel&#8217;s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml</p><pre class="crayon-plain-tag">marvel.agent.enabled: false</pre><p>Start elasticsearch with this command:</p><pre class="crayon-plain-tag">~/elasticsearch-[version]/bin/elasticsearch</pre><p>Navigate to <a href="http://localhost:9200/">http://localhost:9200/</a> to confirm that Elasticsearch is running. Check <a href="http://localhost:9200/_plugin/head">http://localhost:9200/_plugin/head</a> and <a href="http://localhost:9200/_plugin/marvel/sense/index.html">http://localhost:9200/_plugin/marvel/sense/index.html</a> to verify the plugins installed correctly.</p>
<h2>The anatomy of a Logstash config</h2>
<hr />
<h3>Update 21/09/15</h3>
<p>I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: <script src="https://gist.github.com/41ca2c34c50d0d9d8e82.js?file=solr-filter.conf"></script>The rest of the original article contents are unchanged for comparison&#8217;s sake.</p>
<hr />
<p>All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:</p><pre class="crayon-plain-tag">input {
  file {
    path =&gt; "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host =&gt; "localhost"
    template =&gt; "~/logstash/bin/logstash_solr_template.json"
    index =&gt; "solr-%{+YYYY.MM.dd}"
    template_overwrite =&gt; true
  }</pre><p>This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the <a href="https://www.elastic.co/guide/en/logstash/current/input-plugins.html">input</a> and <a href="https://www.elastic.co/guide/en/logstash/current/output-plugins.html">output</a> plugins&#8217; documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-template-json">here</a>.</p>
<p>To process the Solr logs, we&#8217;ll use the <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html">grok</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-mutate.html">mutate</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-multiline.html">multiline</a>, <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-drop.html">drop</a> and <a href="https://www.elastic.co/guide/en/logstash/current/plugins-filters-kv.html">kv</a> filter plugins.</p>
<ul>
<li>Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the <a href="https://grokdebug.herokuapp.com/">grok debugger app</a> is particularly helpful. Be mindful though that some of the escaping syntax isn&#8217;t always the same in the app as what the Logstash config expects.</li>
<li>We need the multiline plugin to link stacktraces to their initial error message.</li>
<li>The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events</li>
<li>We use mutate to add and remove tags along the way.</li>
<li>And finally, drop to drop any events we don&#8217;t want to keep.</li>
</ul>
<p>&nbsp;</p>
<h2>The <del>hard</del> fun part</h2>
<p>Lets dive into the filter stage now. Take a look at the <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-logstash-conf">config file</a> I&#8217;m using. The Grok patterns may appear a bit daunting, especially if you&#8217;re not very familiar with regexp and the default Grok patterns, but don&#8217;t worry! Lets break it down.</p>
<p>The first section extracts the log event&#8217;s severity and timestamp into their own fields, &#8216;level&#8217; and &#8216;LogTime&#8217;:</p><pre class="crayon-plain-tag">grok {
    match =&gt; { "message" =&gt; "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure =&gt; []
  }</pre><p>So, given this line from my <a href="https://gist.github.com/sebnmuller/41ca2c34c50d0d9d8e82#file-solr-log">example log file</a></p><pre class="crayon-plain-tag">INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&amp;literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&amp;literal.id=epifile_211278&amp;literal.epifileid_s=211278&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&amp;literal.filesource_s=SiteFile} {} 0 65</pre><p>We&#8217;d extract</p><pre class="crayon-plain-tag">{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}</pre><p>In the template file I linked earlier, you&#8217;ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we&#8217;d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you&#8217;ll notice I use</p><pre class="crayon-plain-tag">tag_on_failure =&gt; []</pre><p>in most of my Grok stages. The default value is &#8220;_grokparsefailure&#8221;, which I don&#8217;t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.</p>
<p>The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.</p><pre class="crayon-plain-tag"># Combine commit events into single message
  multiline {
      pattern =&gt; "^\t(commit\{)"
      what =&gt; "previous"
    }</pre><p>Now we come to a major section for handling general INFO level messages.</p><pre class="crayon-plain-tag"># INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure =&gt; []  
    }
    if [params] {
      kv {
        field_split =&gt; "&amp;"
        source =&gt; "params"
      }
    } else {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure =&gt; [ "drop" ]
        add_field =&gt; {
          "action" =&gt; "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }</pre><p>This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document&#8217;s extracted contents look like when stored in Elasticsearch:</p><pre class="crayon-plain-tag">{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&amp;literal.epi_file_title=A05100_Tass5+Trondheim.pdf&amp;literal.title=A05100_Tass5+Trondheim.pdf&amp;literal.id=epifile_211027&amp;literal.epifileid_s=211027&amp;literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&amp;literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}</pre><p>If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with &#8220;drop&#8221;. Finally, any messages tagged with &#8220;drop&#8221; are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:</p><pre class="crayon-plain-tag">.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}</pre><p>The next section handles ERROR level messages:</p><pre class="crayon-plain-tag"># Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern =&gt; "^\s"
      what =&gt; "previous"
      add_tag =&gt; [ "multiline_pre" ]
    }
    multiline {
        pattern =&gt; "^Caused by"
        what =&gt; "previous"
        add_tag =&gt; [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match =&gt; {
          "message" =&gt; ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure =&gt; []
      }
    }
  }</pre><p>Given a stack trace, there&#8217;s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.</p>
<p>Finally, I drop any empty lines and clean up temporary tags:</p><pre class="crayon-plain-tag"># Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag =&gt; [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }</pre><p>To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:</p><pre class="crayon-plain-tag"># aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}</pre><p>You should get back all your processed log events along with an aggregation on event severity.</p>
<h2>Conclusion</h2>
<p>Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I&#8217;ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/09/21/solr-logstash-analysis/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Impressions from Berlin Buzzwords 2015</title>
		<link>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/#comments</comments>
		<pubDate>Mon, 08 Jun 2015 13:34:53 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[bbuzz]]></category>
		<category><![CDATA[berlin buzzwords]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Kafka]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3720</guid>
		<description><![CDATA[May 31 &#8211; June 3 2015 Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends. The conference is focused on three core concepts &#8211; search, data and scale, bringing together a diverse range of people [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/andre-bbuzz-beyond-significant-terms.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/andre-bbuzz-beyond-significant-terms-300x194.png" alt="andre-bbuzz-beyond-significant-terms" width="300" height="194" class="alignright size-medium wp-image-3741" /></a>May 31 &#8211; June 3 2015</p>
<p></a>Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. <a href="http://berlinbuzzwords.de/">Berlin Buzzwords</a> undoubtedly lives up to its name by presenting the frontlines of data technology trends.<br />
<span id="more-3720"></span><br />
The conference is focused on three core concepts &#8211; search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.<br />
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.</p>
<h3>Comperio</h3>
<p>Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” &#8211; a deep dive into how to utilize Elasticsearch built in indexes and APIs  for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.</p>
<h3>The talks</h3>
<p>Many people attended the comparison of Solr and Elasticsearch Performance &#038; Scalability with Radu Gheorghe &#038; Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.</p>
<p>Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.</p>
<h3>SQL?</h3>
<p>Another theme threatening to return from the basement was how to properly support SQL style joins into search engines.  Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.</p>
<h3>Talking the talk</h3>
<p>As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.<br />
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.</p>
<h3>Hackathon</h3>
<p>Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop&#8217;s MapReduce component.  It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.</p>
<h3>The buzz</h3>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/berlinbuzzwordsLogo.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/berlinbuzzwordsLogo-300x176.png" alt="berlinbuzzwordsLogo" width="300" height="176" class="alignright size-small wp-image-3726" /></a><br />
Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.</p>
<p>Videos from most talks are available at <a href="https://www.youtube.com/playlist?list=PLq-odUc2x7i-_qWWixXHZ6w-MxyLxEC7s">youtube.com</a></p>
<p><b>Beyond significant terms</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/yYFFlyHPGlg?feature=oembed" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></p>
<p><b>Algorithms and data-structures that power Lucene and Elasticsearch</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/eQ-rXP-D80U?feature=oembed" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></p>
<p><b>Practical t-digest Applications</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/CR4-aVvjE6A?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><b>Talk the Talk: How to Communicate with the Non-Coder</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/Je-X850t_L8?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><b>Side by Side with Elasticsearch &#038; Solr part 2</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/01mXpZ0F-_o?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr: Indexing SQL databases made easier! &#8211; Part 2</title>
		<link>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/#comments</comments>
		<pubDate>Tue, 14 Apr 2015 12:56:21 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[solr5]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3477</guid>
		<description><![CDATA[Last summer I wrote a blog post about indexing a MySQL database into Apache Solr. I would like to now revisit the post to update it for use with Solr 5 and start diving into how to implement some basic search features such as Facets Spellcheck Phonetic search Query Completion Setting up our environment The [...]]]></description>
				<content:encoded><![CDATA[<p>Last summer I wrote a <a href="http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/">blog post</a> about indexing a MySQL database into <a href="http://lucene.apache.org/solr/">Apache Solr</a>. I would like to now revisit the post to update it for use with Solr 5 and start diving into how to implement some basic search features such as</p>
<ul>
<li>Facets</li>
<li>Spellcheck</li>
<li>Phonetic search</li>
<li>Query Completion</li>
</ul>
<h2>Setting up our environment</h2>
<p>The requirements remain the same as with the original blogpost:</p>
<ol>
<li>Java 1.7 or greater</li>
<li>A <a href="http://dev.mysql.com/downloads/mysql/">MySQL</a> database</li>
<li>A copy of the <a href="https://launchpad.net/test-db/employees-db-1/1.0.6/+download/employees_db-full-1.0.6.tar.bz2">sample employees database</a></li>
<li>The MySQL <a href="http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.32.tar.gz">jdbc driver</a></li>
</ol>
<p>We&#8217;ll now be using Solr 5, which runs a little differently from previous incarnations of Solr. Download <a href="http://www.apache.org/dyn/closer.cgi/lucene/solr/5.0.0">Solr</a> and extract it to a directory of your choice.Open a terminal and navigate to your Solr directory.<br />
Start Solr with the command <pre class="crayon-plain-tag">bin/solr start</pre> .<img class="alignright wp-image-3497 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-11-at-20.30.03-300x114.png" alt="Solr Status" width="300" height="114" /></p>
<p>To confirm Solr successfully started up, run <pre class="crayon-plain-tag">bin/solr status</pre></p>
<p>Unlike previously, we now need to create a Solr core for our employee data. To do so run this command <pre class="crayon-plain-tag">bin/solr create_core -c employees -d basic_configs</pre> . This will create a core named employees using Solr&#8217;s minimal configuration options. Try <pre class="crayon-plain-tag">bin/solr create_core -help</pre>  to see what else is possible.</p>
<ol>
<li>Open server/solr/employees/conf/solrconfig.xml in a text editor and add the following within the config tags:
<div id="file-dataimporthandler2-LC1" class="line">
<pre class="crayon-plain-tag">&lt;lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" /&gt;
 
&lt;requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"&gt;
&lt;lst name="defaults"&gt;
&lt;str name="config"&gt;db-data-config.xml&lt;/str&gt;
&lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
</div>
</li>
<li>In the same directory, open schema.xml and add this this line:<br />
<pre class="crayon-plain-tag">&lt;dynamicField name="*_name" type="text_general" multiValued="false" indexed="true" stored="true" /&gt;</pre>
</li>
<li>Create a lib subdir in server/solr/employees and extract the MySQL jdbc driver jar into it.</li>
<li>Finally, restart the Solr server with the command <pre class="crayon-plain-tag">bin/solr restart</pre></li>
</ol>
<p>When started this way, Solr runs by default on port 8983. Use <pre class="crayon-plain-tag">bin/solr start -p portnumber</pre>  and replace portnumber with your preferred choice to start it on that one.</p>
<p>Navigate to <a href="http://localhost:8983/solr">http://localhost:8983/solr</a> and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select our employee core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of <pre class="crayon-plain-tag">Can't find resource 'db-data-config.xml' in classpath</pre> . This is normal as we haven&#8217;t actually created this file yet, which stores the configs for connecting to our target database.</p>
<p>We&#8217;ll come back to that file later but let&#8217;s make our demo database now. If you haven&#8217;t already downloaded the sample employees database and installed MySQL, now would be a good time!</p>
<h2>Setting up our database</h2>
<p>Please refer to the instructions in the same section in the <a href="http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/">original blog post</a>. The steps are still the same.</p>
<h2>Indexing our database</h2>
<p>Again, please refer to the instructions in the same section in the original blog post. The only difference is the Postman collection should be imported from <a href="https://www.getpostman.com/collections/f7634c89cd9851dd2c13"> this url</a> instead. The commands you can use alternatively have also changed and are now</p><pre class="crayon-plain-tag">Clear index: http://localhost:8983/solr/employees/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true
Retrieve all: http://localhost:8983/solr/employees/select?q=*:*&amp;omitHeader=true
Index db: http://localhost:8983/solr/employees/collection1/dataimport?command=full-import
Reload core: http://localhost:8983/solr/employees/admin/cores?action=RELOAD&amp;core=collection1
Georgi query: http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name&amp;defType=edismax
Facet query: http://localhost:8983/solr/employees/select?q=*:*&amp;wt=json&amp;facet=true&amp;facet.field=dept_s&amp;facet.field=title_s&amp;facet.mincount=1&amp;rows=0
Gorgi spellcheck: http://localhost:8983/solr/employees/select?q=gorgi&amp;wt=json&amp;qf=first_name&amp;defType=edismax
Georgi Phonetic: http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name%20phonetic&amp;defType=edismax</pre><p></p>
<h2>The next step</h2>
<p>We should now be back where we ended with the original blog post. So far we have successfully</p>
<ul>
<li>Setup a database with content</li>
<li>Indexed the database into our Solr index</li>
<li>Setup basic scheduled delta reindexing</li>
</ul>
<p>Let&#8217;s get started with the more interesting stuff!</p>
<h2>Facets</h2>
<p>Facets, also known as filters or navigators, allow a search user to refine and drill down through search results. Before we get started with them, we need to update our data import configuration. Replace the contents of our existing db-data-config.xml with:</p>
<div id="file-db-data-config2-LC1" class="line">
<pre class="crayon-plain-tag">select e.emp_no as 'id', e.birth_date,
(
select t.title
order by t.`from_date` desc
limit 1
) as 'title_s', e.first_name, e.last_name, e.gender as 'gender_s', d.`dept_name` as 'dept_s'
from employees e
join dept_emp de on de.`emp_no` = e.`emp_no`
join departments d on d.`dept_no` = de.`dept_no`
join titles t on t.`emp_no` = e.`emp_no`
group by e.`emp_no`
limit 1000;</pre><br />
To be able to facet, we need appropriate fields upon which to actually facet. Our new SQL retrieves additional fields such as employee titles and departments. Fields perfect for use as facets.</p>
</div>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/04/Screen-Shot-2015-09-23-at-10.27.47.png"><img class="aligncenter size-medium wp-image-3520" src="http://blog.comperiosearch.com/wp-content/uploads/2015/04/Screen-Shot-2015-09-23-at-10.27.47.png" alt="Updated Employee SQL" width="300" height="166" /></a><br />
You&#8217;ll notice we map title, gender and dept_name to title_s, gender_s and dept_s respectively. This allows us to take advantage of an existing dynamic field mapping in Solr&#8217;s default basic config, *_s. A dynamic field allows us to assign all fields with a certain pre/suffix the same field type. In this case, given the field type <pre class="crayon-plain-tag">&lt;dynamicField name="*_s" type="string" indexed="true" stored="true" /&gt;</pre> , any fields ending with _s will be indexed and stored as basic strings. Solr will not tokenise them and modify their contents. This allows us to safely use them for faceting without worrying about department titles being split on white spaces for example.</p>
<ol>
<li>Clear the index and restart Solr.<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-13-at-17.06.22.png"><img class="alignright wp-image-3533 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-13-at-17.06.22-196x300.png" alt="Facet Query" width="196" height="300" /></a></li>
<li>Once Solr has restarted, reindex the database with<br />
our new SQL. Don&#8217;t be alarmed if this takes a bit longer<br />
than previously. It&#8217;s a bit more heavy weight and not<br />
very well optimised!</li>
<li>Once it&#8217;s done indexing, we can<br />
confirm it was successful by running the facet query via<br />
Postman or directly in our browser.</li>
<li>We should see two hits for the query &#8220;georgi&#8221; along with<br />
facets for their respective titles and department.</li>
</ol>
<p>&nbsp;</p>
<h2>The anatomy of a facet query</h2>
<p>Let&#8217;s take a closer look at the relevant request parameters of our facet query: <pre class="crayon-plain-tag">http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name&amp;defType=edismax&amp;omitHeader=true&amp;facet=true&amp;facet.field=dept_s&amp;facet.field=title_s&amp;facet.mincount=1</pre></p>
<ul>
<li>facet &#8211; Tells Solr to enable or prevent faceting. Accepted values include yes,on and true to enable, no, off and false to disable</li>
<li>facet.field &#8211; Which field we want to facet on, can be defined multiple times</li>
<li>facet.mincount &#8211; The minimum number of values for a particular facet value the query results includes for it to be included in the facet result object. Can be defined per facet field with this syntax f.fieldName.facet.mincount=1</li>
</ul>
<p>There are many other facet parameters. I recommend taking a look at the Solr wiki pages on <a href="https://wiki.apache.org/solr/SolrFacetingOverview">faceting</a> and other <a href="https://wiki.apache.org/solr/SimpleFacetParameters">possible parameters</a>.</p>
<h2>Spellcheck</h2>
<p>Analysing query logs and focusing on those queries that gave zero hits is a quick and easy way to see what can and should be done to improve your search solution. More often than not you will come across a great deal of spelling errors. Adding spellcheck to a search solution gives such great value for a tiny bit of effort. This fruit is so low hanging it should hit you in the face!</p>
<p>To enable spellcheck, we need to make some configuration changes.</p>
<ol>
<li>In our schema.xml, add these two lines after the *_name dynamic field type we added earlier:
<div class="line">
<pre class="crayon-plain-tag">&lt;copyField source="*_name" dest="spellcheck" /&gt;
&lt;field name="spellcheck" type="text_general" indexed="true" stored="true" multiValued="true" /&gt;</pre>
</div>
<p>A copyField checks for fields whose names match the pattern defined in source and copies their destinations to the dest field. In our case, we will copy content from first_name and last_name to spellcheck. We then define the spellcheck field as multiValued to handle its multiple sources.</li>
<li>Add the following to our solrconfig.xml:
<div id="file-spellcheck-LC1" class="line">
</p><pre class="crayon-plain-tag">&lt;searchComponent name="spellcheck" class="solr.SpellCheckComponent"&gt;
&lt;str name="queryAnalyzerFieldType"&gt;text_general&lt;/str&gt;
&lt;!-- a spellchecker built from a field of the main index --&gt;
&lt;lst name="spellchecker"&gt;
&lt;str name="name"&gt;default&lt;/str&gt;
&lt;str name="field"&gt;spellcheck&lt;/str&gt;
&lt;str name="classname"&gt;solr.DirectSolrSpellChecker&lt;/str&gt;
&lt;!-- the spellcheck distance measure used, the default is the internal levenshtein --&gt;
&lt;str name="distanceMeasure"&gt;internal&lt;/str&gt;
&lt;!-- minimum accuracy needed to be considered a valid spellcheck suggestion --&gt;
&lt;float name="accuracy"&gt;0.5&lt;/float&gt;
&lt;!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 --&gt;
&lt;int name="maxEdits"&gt;2&lt;/int&gt;
&lt;!-- the minimum shared prefix when enumerating terms --&gt;
&lt;int name="minPrefix"&gt;1&lt;/int&gt;
&lt;!-- maximum number of inspections per result. --&gt;
&lt;int name="maxInspections"&gt;5&lt;/int&gt;
&lt;!-- minimum length of a query term to be considered for correction --&gt;
&lt;int name="minQueryLength"&gt;4&lt;/int&gt;
&lt;!-- maximum threshold of documents a query term can appear to be considered for correction --&gt;
&lt;float name="maxQueryFrequency"&gt;0.01&lt;/float&gt;
&lt;!-- uncomment this to require suggestions to occur in 1% of the documents
&lt;float name="thresholdTokenFrequency"&gt;.01&lt;/float&gt;
--&gt;
&lt;/lst&gt;
&lt;/searchComponent&gt;</pre><p>
</div>
<p>This will create a spellchecker component that uses the spellcheck field as its dictionary source. The spellcheck field contains content copied from both first and last name fields.</li>
<li>In the same file, look for the select requestHandler and update it to include the spellcheck component:
<div id="file-select-LC1" class="line">
</p><pre class="crayon-plain-tag">&lt;requestHandler name="/select" class="solr.SearchHandler"&gt;
&lt;!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
--&gt;
&lt;lst name="defaults"&gt;
&lt;str name="echoParams"&gt;explicit&lt;/str&gt;
&lt;int name="rows"&gt;10&lt;/int&gt;
&lt;str name="spellcheck"&gt;on&lt;/str&gt;
&lt;str name="spellcheck.dictionary"&gt;default&lt;/str&gt;
&lt;/lst&gt;
&lt;!-- Add this to enable spellcheck --&gt;
&lt;arr name="last-components"&gt;
&lt;str&gt;spellcheck&lt;/str&gt;
&lt;/arr&gt;
&lt;/requestHandler&gt;</pre><p>
</div>
</li>
</ol>
<p>The defaults list in a requestHandler defines which default parameters to add to each request made using the chosen request handler. You could, for example, define which fields to query. In this case we&#8217;re enabling spellcheck and using the default dictionary as defined in our solrconfig.xml. All values in the defaults list can be overwritten per request. To include request parameters that cannot be overwritten, we would need to use an invariants list instead:</p><pre class="crayon-plain-tag">&lt;lst name="invariants"&gt;
&lt;str name="defType"&gt;edismax&lt;/str&gt;
&lt;/lst&gt;</pre><p>Both lists can be used simultaneously. When duplicate keys are present the values in the invariants list will take precedence.</p>
<p>Once we&#8217;ve made all our configuration changes, let&#8217;s restart Solr and reindex. To verify the changes worked, do a basic retrieve all query and check the resulting documents for the spellcheck field. Its contents should be the same as the document&#8217;s first_name and last_name fields.</p>
<p>Because we have enabled spellcheck by default,<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-15.20.45.png"><img class="alignright size-medium wp-image-3575" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-15.20.45-174x300.png" alt="Gorgi" width="174" height="300" /></a> queries with possible suggestions will include contents in the spellcheck response object.</p>
<p>Try the Gorgi spellcheck query and experiment with different queries. To query the last_name field as well, change the qf parameter to <pre class="crayon-plain-tag">qf=first_name last_name</pre>.</p>
<p>The qf parameter defines which fields to use as the search domain.</p>
<p>When the spellcheck response object has content, you can easily use it to implement a basic &#8220;did you mean&#8221; feature. This will vastly improve your zero hit page.</p>
<h2>Phonetic Search</h2>
<p>Now that we have a basic spellcheck component in place, the next best feature that easily creates value in a people search system is phonetics. Solr ships with some <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory">basic</a> <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.DoubleMetaphoneFilterFactor">phonetic</a> <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.BeiderMorseFilterFactory">tokenisers</a>. The most commonly used out of the box phonetic tokeniser is the DoubleMetaphoneFilterFactory. It will suffice for most use cases. It does, however, have some weaknesses, which we will go into briefly in the next section.</p>
<p>We need to once again modify our schema.xml to take advantage of Solr&#8217;s phonetic capabilities. Add the following:</p><pre class="crayon-plain-tag">&lt;fieldType name="phonetic" class="solr.TextField" &gt;
 &lt;analyzer&gt;
 &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
 &lt;filter class="solr.DoubleMetaphoneFilterFactory" inject="true" maxCodeLength="4"/&gt;
 &lt;/analyzer&gt;
 &lt;/fieldType&gt;

 &lt;copyField source="*_name" dest="phonetic" /&gt;
 &lt;field name="phonetic" type="phonetic" indexed="true" stored="false" multiValued="true" /&gt;</pre><p>Similar to spellcheck, we copy contents from the name fields into a phonetic field. Here we define a phonetic field, whose values will not be stored as we don&#8217;t need to return them in search results. It is, however, indexed so we can actually include it in the search domain. Finally, like spellcheck, it is multivalued to handle multiple potential sources. The reason we create an additional search field is so we can apply different weightings to exact matches and phonetic matches.</p>
<p>Restart Solr, clear the index and reindex.</p>
<p>Running the Georgi Phonetic search request should now returns hits based on exact and phonetic matches. To ensure that exact matches are ranked higher, we can add a query time boost to our query fields: <pre class="crayon-plain-tag">&amp;qf=first_name last_name phonetic^0.5</pre></p>
<p>Rather than apply boosts to fields we want to rank higher, it&#8217;s usually simpler to apply a punitive boost to fields we wish to rank lower. Replace the qf parameter in the Georgi Phonetic request and see how the first few results all have an exact match for georgi in the first_name field.</p>
<h2>Query Analysis</h2>
<p>As we look further down the result set, you will notice some strange matches. One employee, called Kirk Kalsbeek, is apparently a match for &#8220;georgi&#8221;. To understand why this is a match, we can use Solr&#8217;s analysis tool.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-17.09.01.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-17.09.01-300x247.png" alt="Solr Analysis" width="300" height="247" class="aligncenter size-medium wp-image-3585" /></a><br />
It allows use to define an indexed value, a query value and the field type to use and then demonstrate how each value is tokenised and whether or not the query would result in a match.</p>
<p>With the values Kirk Kalsbeek, georgi and phonetic respectively, the analysis tool shows us that Kirk gets tokenised to KRK by our phonetic field type. Georgi is also tokenised to KRK, which results in a match.</p>
<p>To create a better phonetic search solution, we would have to implement a custom phonetic tokeniser. I came across <a href="https://github.com/kvalle/norphoname"> an example</a>, which has helped me enormously in improving phonetic search for Norwegian names on a project.</p>
<h2>Conclusion</h2>
<p>We should now be able to </p>
<ul>
<li>Implement index field based spellcheck</li>
<li>Use basic faceting</li>
<li>Implement Solr&#8217;s out of the box phonetic capabilities</li>
</ul>
<p>Query completion I will leave for the next time. I promise you won&#8217;t have to wait as long between posts as last time :)</p>
<p>Let me know how you get on in the comments below!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Solr As A Document Processing Pipeline</title>
		<link>http://blog.comperiosearch.com/blog/2015/01/16/custom-solr-update-request-processors/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/01/16/custom-solr-update-request-processors/#comments</comments>
		<pubDate>Fri, 16 Jan 2015 10:40:48 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[content enrichment]]></category>
		<category><![CDATA[Document Processing]]></category>
		<category><![CDATA[pipeline]]></category>
		<category><![CDATA[update request processor]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3050</guid>
		<description><![CDATA[Recently on a project I got an interesting request. Content owners wanted to enrich new documents submitted to the search index with content from documents already present in the index. We use Solr as the search backend for this particular customer so I started thinking about how to achieve this with Solr. A bit of [...]]]></description>
				<content:encoded><![CDATA[<p>Recently on a project I got an interesting request. Content owners wanted to enrich new documents submitted to the search index with content from documents already present in the index. We use Solr as the search backend for this particular customer so I started thinking about how to achieve this with Solr.</p>
<h2>A bit of Solr background</h2>
<p>Solr ships with all the tools and features necessary for an advanced search solution. These include the oft overlooked update request processors. They operate at the document level i.e. prior to individual field tokenisation and allow you to clean, modify and/or enrich incoming documents. Processing options include language identification, duplicate detection and HTML markup handling. Create a chain of them and you have a true document processing pipeline.</p>
<p>The Solr wiki includes a <a title="Update Request Processors" href="https://wiki.apache.org/solr/UpdateRequestProcessor#Full_list_of_UpdateRequestProcessor_Factories">brief entry </a> on the topic with an example of a custom processor that conditionally adds the field &#8220;cat&#8221; with value &#8220;popular&#8221;. The full list of UpdateRequestProcessor factories is available via the <a href="http://www.solr-start.com/info/update-request-processors/">Solr Start project</a>.</p>
<h2>Back to the initial request</h2>
<p>Certain incoming documents would contain a field, topicRef for example, with a reference to one or more documents already present in the index. The referenced documents could either contain a subsequent reference or content that we wanted to add to the incoming document. <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/10/docProcess.png"><img class="size-medium wp-image-3054 alignright" src="http://blog.comperiosearch.com/wp-content/uploads/2014/10/docProcess-220x300.png" alt="document pipeline" width="220" height="300" /></a></p>
<p>I needed a mechanism to retrieve any referenced documents, traverse a tree of subsequently referenced documents if necessary, and then map the eventual leaf documents&#8217; specified content fields to additional new fields in the incoming document.</p>
<p>I created a recursive document enrichment processor to do just that!</p>
<p>Its settings allow for multiple potential field retrievals and mappings, local and foreign key field definitions and the option to retrieve content from a remote Solr index.</p>
<script src="https://gist.github.com/fcd5b45cd42a40b97daa.js?file=RecursiveMergeExistingDocFactory"></script>
<p>A minor drawback of the current iteration of the processor is a high reliance on the existence of referenced documents i.e. if the referenced documents are not already present in the index then the processor will skip over them. To ensure documents are fully enriched, especially if the referenced documents are included in the same indexing batch, reindexes of incoming documents is necessary unless explicitly defining the document indexing order.</p>
<p>In addition, when a referenced document is updated, content owners expect this to have an impact on the content of the parent document and therefore a user&#8217;s search experience. This is currently not the case as parent documents are unaware of their child documents beyond the indexing process.</p>
<p>I&#8217;m now thoroughly enjoying tackling these issues and working on the next iteration of this RecursiveMergeExistingDoc processor!</p>
<h2>Update &#8211; 06/02/15</h2>
<p>The source code is now available on <a href="https://github.com/sebnmuller/SolrDocumentEnricher">github</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/01/16/custom-solr-update-request-processors/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Idea: Your life searchable through Norch &#8211; NOde seaRCH, IFTTT and Google Drive</title>
		<link>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/#comments</comments>
		<pubDate>Wed, 26 Nov 2014 14:33:08 +0000</pubDate>
		<dc:creator><![CDATA[Espen Klem]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[Document Processing]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Google Drive]]></category>
		<category><![CDATA[IFTTT]]></category>
		<category><![CDATA[Index]]></category>
		<category><![CDATA[Json]]></category>
		<category><![CDATA[Life Index]]></category>
		<category><![CDATA[Lifeindex]]></category>
		<category><![CDATA[node]]></category>
		<category><![CDATA[Node Search]]></category>
		<category><![CDATA[node.js]]></category>
		<category><![CDATA[nodejs]]></category>
		<category><![CDATA[norch]]></category>
		<category><![CDATA[Personal Search Engine]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[search-index]]></category>
		<category><![CDATA[sharepoint]]></category>
		<category><![CDATA[Small Data]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3069</guid>
		<description><![CDATA[First some disclaimers: This has been posted earlier on lab.klemespen.com. Even though some of these ideas are not what you&#8217;d normally implement in a business environment, some of the concepts can obviously be transferred over to businesses trying to provide an efficient workplace for its employees. Norch is developed by Fergus McDowall, an employee of [...]]]></description>
				<content:encoded><![CDATA[<p><strong>First some disclaimers</strong>:</p>
<ul>
<li>This has been posted earlier on <a href="http://lab.klemespen.com/2014/11/25/idea-your-life-searchable-with-norch-node-search-ifttt-and-google-drive-spreadsheets/">lab.klemespen.com</a>.</li>
<li>Even though some of these ideas are not what you&#8217;d normally implement in a business environment, some of the concepts can obviously be transferred over to businesses trying to provide an efficient workplace for its employees.</li>
<li><a href="https://github.com/fergiemcdowall/norch">Norch</a> is developed by <a href="http://blog.comperiosearch.com/blog/author/fmcdowall/">Fergus McDowall</a>, an employee of Comerio.</li>
</ul>
<p>What if you could index your whole life and make this lifeindex available through search? What would that look like, and how could it help you? Refinding information is obviously one of the use case for this type of search. I&#8217;m guessing there&#8217;s a lot more, and I&#8217;m curious to figure them out.</p>
<h2>Actions and reactions instead of web pages</h2>
<p>I had the lifeindex idea for a little while now. Originally the idea was to index everything I browsed. From what I know and where <a href="https://github.com/fergiemcdowall/norch">Norch</a> is, it would take a while before I was anywhere close to achieving that goal. <a href="http://codepen.io/nickmoreton/blog/using-ifttt-and-google-drive-to-create-a-json-api">Then I thought of IFTTT</a>, and saw it as a &#8216;next best thing&#8217;. But then it hit me that now I&#8217;m indexing actions, and that&#8217;s way better than pages. But what I&#8217;m missing from most sources now are the reactions to my actions. If I have a question, I also want to crawl and index the answer. If I have a statement, I want to get the critique indexed.<span id="more-3069"></span></p>
<p>IFTTT and similar services (like Zapier) is quite limiting in their choice of triggers. Not sure if this is because of choices done by those services or limitations from the sites they crawl/pull information from.</p>
<p>A quick fix for this, and a generally good idea for Search Engines, would be to switch from a preview of your content to the actual content in the form of an embed-view. Here exemplified:</p>
<blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr">Will embed-view of your content replace the preview-pane in modern <a href="https://twitter.com/hashtag/search?src=hash&amp;ref_src=twsrc%5Etfw">#search</a>  <a href="https://twitter.com/hashtag/engine?src=hash&amp;ref_src=twsrc%5Etfw">#engine</a> solutions? Why preview when you can have the real deal?</p>
<p>&mdash; Espen Klem (@eklem) <a href="https://twitter.com/eklem/status/536866049078333440?ref_src=twsrc%5Etfw">November 24, 2014</a></p></blockquote>
<p><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></p>
<h2>Technology: Hello IFTTT, Google SpreadSheet and Norch</h2>
<p>IFTTT is triggered by my actions, and stores some data to a series of spreadsheets on Google Drive. <a href="http://jsonformatter.curiousconcept.com/#https://spreadsheets.google.com/feeds/list/1B-OFzKIMVNk_3xMX_jBToGGyxSKv6FoyFYTHpGEy5O0/od6/public/values?alt=json">These spreadsheets can deliver JSON</a>. After a little document processing these JSON-files can be fed to the <a href="https://github.com/fergiemcdowall/norch#norch-indexer">Norch-indexer</a>.</p>
<h2>Why hasn&#8217;t this idea popped up earlier?</h2>
<p>Search engines used to be hardware guzzling technology. With Norch, the &#8220;NOde seaRCH&#8221; engine, that has changed. Elasticsearch and Solr are easy and small compared to i.e. SharePoint Search, but still it needs a lot of hardware. Norch can run on a Raspberry Pi, and soon it will be able to run in your browser. Maybe data sets closer to <a href="http://en.wikipedia.org/wiki/Small_data">small data</a> is more interesting than <a href="http://en.wikipedia.org/wiki/Big_data">big data</a>?</p>
<p><a href="http://youtu.be/ijLtk5TgvZg"><img src="http://blog.comperiosearch.com/wp-content/uploads/2014/11/Screen-Shot-2014-11-26-at-16.42.27-300x180.png" alt="Video: Norch running on a Raspberry Pi" width="300" height="180" class="alignnone size-medium wp-image-3075" />Norch running on a Raspberry Pi</a></p>
<h2>Why using a search engine?</h2>
<p>It&#8217;s cheap and quick. I&#8217;m not a developer, and I&#8217;ll still be able to glue all these sources together. Search engines are often a good choice when you have multiple sources. IFTTT and Google SpreadSheet makes it even easier, normalising the input and delivering it as JSON.</p>
<h2>How far in the process have I come?</h2>
<p><a href="https://testlab3.files.wordpress.com/2014/11/15140752323_1f69685449_o.png"><img class="alignnone size-full wp-image-118" src="https://testlab3.files.wordpress.com/2014/11/15140752323_1f69685449_o.png" alt="Illustration: Setting up sources in IFTTT." width="660" height="469" /></a></p>
<p>So far, I&#8217;ve set up a lot of triggers/sources at IFTTT.com:</p>
<ul>
<li>Instagram: When posting or liking both photos and videos.</li>
<li>Flickr: When posting an image, creating a set or linking a photo.</li>
<li>Google Calendar: When adding something to one of my calendars.</li>
<li>Facebook: When i post a link, is tagged, post a status message.</li>
<li>Twitter: When I tweet, retweet, reply or if somebody mentions me.</li>
<li>Youtube: When I post or like a video.</li>
<li>GitHub: When I create an issue, gets assigned to an issue or any issues that I part take in is closed.</li>
<li>WordPress: When new posts or comments on posts.</li>
<li>Android location tracking: When I enter and exit certain areas.</li>
<li>Android phone log: Placed, received and missed calls.</li>
<li>Gmail: Starred emails.</li>
</ul>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-27-57.png"><img class="alignnone size-full wp-image-127" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-27-57.png" alt="Screen Shot 2014-11-24 at 13.27.57" width="660" height="572" /></a></p>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-31-46.png"><img class="alignnone size-full wp-image-128" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-31-46.png" alt="Screen Shot 2014-11-24 at 13.31.46" width="660" height="194" /></a></p>
<p>And gotten a good chunk of data. Indexing my SMS&#8217;es felt a bit creepy, so I stopped doing that. And storing email just sounded too excessive, but I think starred emails would suit the purpose of the project.</p>
<p>Those Google Drive documents are giving me JSON. Not JSON that I can feed directly Norch-indexer, it needs a little trimming.</p>
<h2>Issues discovered so far</h2>
<h3>Manual work</h3>
<p>This search solution needs a lot of manual setup. Every trigger needs to be set up manually. Everytime a new trigger is triggered, I get a new spreadsheet that needs a title row added. Or else, the JSON variables will look funny, since first row is used for variable names.</p>
<p>The spreadsheets only accepts 2000 rows. After that a new file is created. Either I need to delete content, rename the file or reconfigure some stuff.</p>
<h3>Level of maturity</h3>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-41-34.png"><img class="alignnone size-full wp-image-129" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-41-34.png" alt="Screen Shot 2014-11-24 at 13.41.34" width="660" height="664" /></a></p>
<p>IFTTT is a really nice service, and they treat their users well. But, for now, it&#8217;s not something you can trust fully.</p>
<h3>Cleaning up duplicates and obsolete stuff</h3>
<p>I have no way of removing stuff from the index automatically at this point. If I delete something I&#8217;ve added/written/created, it will not be reflected in the index.</p>
<h3>Missing sources</h3>
<p>Books I buy, music I listen to, movies and TV-series I watch. Or Amazon, Spotify, Netflix and HBO. Apart from that, there are no Norwegian services available through IFTTT.</p>
<h3>History</h3>
<p>The crawling is triggered by my actions. That leaves me without history. So, i.e. new contacts on LinkedIn is meaningless when I don&#8217;t get to index the existing ones.</p>
<h2>Next steps</h2>
<h3>JSON clean-up</h3>
<p>I need to make a document processing step. <a href="https://github.com/fergiemcdowall/norch-document-processor">Norch-document-processor</a> would be nice if it had handled JSON in addition to HTML. <a href="https://github.com/fergiemcdowall/norch-document-processor/issues/6">Not yet, but maybe in the future</a>? Anyway, there&#8217;s just a small amount of JSON clean-up before I got my data in and index.</p>
<p>When this step is done, a first version can be demoed.</p>
<h3>UX and front-end code</h3>
<p>To show the full potential, I need some interaction design of the idea. For now they&#8217;re all in my head. And these sketches needs to be converted to HTML, CSS and Angular view.</p>
<h3>Embed codes</h3>
<p>Figure out how to embed Instagram, Flickr, Facebook and LinkedIn-posts, Google Maps, federated phonebook search etc.</p>
<h3>OAUTH configuration</h3>
<p>Set up <a href="https://github.com/ciaranj/node-oauth">OAUTH NPM package</a> to access non-public spreadsheets on Google Drive. Then I can add some of the less open information I have stored.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr: Indexing SQL databases made easier!</title>
		<link>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/#comments</comments>
		<pubDate>Thu, 28 Aug 2014 12:05:17 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[jdbc]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[people search]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2848</guid>
		<description><![CDATA[Update Part two is now available here! At the beginning of this year Christopher Vig wrote a great post about indexing an SQL database to the internet&#8217;s current search engine du jour, Elasticsearch. This first post in a two part series will show that Apache Solr is a robust and versatile alternative that makes indexing [...]]]></description>
				<content:encoded><![CDATA[<h3>Update</h3>
<p>Part two is now available <a href="http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/">here!</a></p>
<hr />
<p>At the beginning of this year <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christopher Vig</a> wrote a <a href="http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/">great post </a>about indexing an SQL database to the internet&#8217;s current search engine du jour, <a href="http://www.elasticsearch.org/">Elasticsearch.</a> This first post in a two part series will show that <a href="http://lucene.apache.org/solr/">Apache Solr</a> is a robust and versatile alternative that makes indexing an SQL database just as easy. The second will go deeper into how to make leverage Solr&#8217;s features to create a great backend for a people search solution.</p>
<p>Solr ships with a configuration driven contrib called the <a href="http://wiki.apache.org/solr/DataImportHandler">DataImportHandler.</a> It provides a way to index structured data into Solr in both full and incremental delta imports. We will cover a simple use case of the tool i.e. indexing a database containing personnel data to form the basis of a people search solution. You can also easily extend the DataImportHandler tool via various <a href="http://wiki.apache.org/solr/DataImportHandler#Extending_the_tool_with_APIs">APIs</a> to pre-process data and handle more complex use cases.</p>
<p>For now, let&#8217;s stick with basic indexing of an SQL database.</p>
<h2>Setting up our environment</h2>
<p>Before we get started, there are a few requirements:</p>
<ol>
<li>Java 1.7 or greater</li>
<li>For this demo we&#8217;ll be using a <a href="http://dev.mysql.com/downloads/mysql/">MySQL</a> database</li>
<li>A copy of the <a href="https://launchpad.net/test-db/employees-db-1/1.0.6/+download/employees_db-full-1.0.6.tar.bz2">sample employees database</a></li>
<li>The MySQL <a href="http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.32.tar.gz">jdbc driver</a></li>
</ol>
<p>With that out of the way, let&#8217;s get Solr up and running and ready for database indexing:</p>
<ol>
<li>Download <a href="https://lucene.apache.org/solr/downloads.html">Solr</a> and extract it to a directory of your choice.</li>
<li>Open solr-4.9.0/example/solr/collection1/conf/solrconfig.xml in a text editor and add the following within the config tags:  <script src="https://gist.github.com/dd7cef212fd7f6a415b5.js?file=DataImportHandler"></script></li>
<li>In the same directory, open schema.xml and add this this line   <script src="https://gist.github.com/5bbc8c6e1a5b617b5d16.js?file=names"></script></li>
<li>Create a lib subdir in solr-4.9.0/solr/collection1/ and extract the MySQL jdbc driver jar into it. It&#8217;s the file called mysql-connector-java-{version}-bin.jar</li>
<li>To start Solr, open a terminal and navigate to the example subdir in your extracted Solr directory and run <code>java -jar start.jar</code></li>
</ol>
<p>When started this way, Solr runs by default on port 8983. If you need to change this, edit solr-4.9.0/example/etc/jetty.xml and restart Solr.</p>
<p>Navigate to <a href="http://localhost:8983/solr">http://localhost:8983/solr</a> and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select the default core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of <code>Can't find resource 'db-data-config.xml' in classpath</code>. This is normal as we haven&#8217;t actually created this file yet, which stores the configs for connecting to our target database.</p>
<p>We&#8217;ll come back to that file later but let&#8217;s make our demo database now. If you haven&#8217;t already downloaded the sample employees database and installed MySQL, now would be a good time!</p>
<h2>Setting up our database</h2>
<p>Assuming your MySQL server is installed <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase.png"><img class="alignright size-full wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase-300x226.png" alt="Prepare indexing database" width="300" height="226" /></a>and running, access the MySQL terminal and create the empty employees database: <code>create database employees;</code></p>
<p>Exit the MySQL terminal and import the employees.sql into your empty database, ensuring that you carry out the following command from the same directory as the employees.sql file itself: <code>mysql -u root -p employees &lt; employees.sql</code></p>
<p>You can test this was successful by logging <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase.png"><img class="alignright size-medium wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase-276x300.png" alt="Verify indexing database" width="276" height="300" /></a>into the MySql server and querying the database, as shown here on the right.</p>
<p>Having successfully created and populated your employee database, we can now create that missing db-data-config.xml file.</p>
<h2>Indexing our database</h2>
<p>In your Solr conf directory, which contains the schema.xml and solrconfig.xml we previously modified, create a new file called db-data-config.xml.</p>
<p>Its contents should look like the example below. Make sure to replace the user and password values with yours and feel free to modify or remove the limit parameter. There&#8217;s approximately 30&#8217;000 entries in the employees table in total <script src="https://gist.github.com/03935f1384e150504363.js?file=db-data-config"></script></p>
<p>We&#8217;re now going to make use of Solr&#8217;s REST-like HTTP API with a couple of commands worth saving. I prefer to use the <a href="https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm">Postman app</a> on Chrome and have created a public collection of HTTP requests, which you can import into Postman&#8217;s Collections view using this url: <a href="https://www.getpostman.com/collections/9e95b8130556209ed643">https://www.getpostman.com/collections/9e95b8130556209ed643</a></p>
<p>For those of you not using Chrome, here are the commands you will need:<script src="https://gist.github.com/05a2a1dd01a6c5a4517b.js?file=solr-http"></script> First let&#8217;s reload the core so that Solr is <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore.png"><img class="alignright size-medium wp-image-2921" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore-300x181.png" alt="Reload Solr core" width="300" height="181" /></a><br />
aware of the new db-data-config.xml file we have created.<br />
Next, we index our database with the <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb.png"><img class="alignright size-medium wp-image-2923" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb-300x181.png" alt="Index database to Solr" width="300" height="181" /></a>HTTP request or from within the Solr Admin GUI on the DataImport page.</p>
<p>Here we have carried out a full index of our database using the full-import command parameter. To only retrieve changes since the last import, we would use delta-import instead.</p>
<p>We can confirm that our database import was successful by querying our index with the &#8220;Retrieve all&#8221; and &#8220;Georgi query&#8221; requests.</p>
<p>Finally, to schedule reindexing you can use a simple cronjob. This one, for example, will run everyday at 23:00 and retrieve all changes since the previous indexing operation:<script src="https://gist.github.com/47f6df5a306e4cd51617.js?file=delta"></script></p>
<h2>Conclusion</h2>
<p>So far we have successfully</p>
<ul>
<li>Setup a database with content</li>
<li>Indexed the database into our Solr index</li>
<li>Setup basic scheduled delta reindexing</li>
</ul>
<p>In the next part of this two part series we will look at how to process our indexed data. Specifically, with a view to making a good people search solution. We will implement several features such as phonetic search, spellcheck and basic query completion. In the meantime, let&#8217;s carry on the conversation in the comments below!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Search technology: Picking the right horse</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/06/elasticsearch-or-solr-picking-right-horse/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/06/elasticsearch-or-solr-picking-right-horse/#comments</comments>
		<pubDate>Fri, 06 Jun 2014 10:25:54 +0000</pubDate>
		<dc:creator><![CDATA[Ole-Kristian Villabø]]></dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[search platform]]></category>
		<category><![CDATA[search technology]]></category>
		<category><![CDATA[search trends]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2483</guid>
		<description><![CDATA[For many years, Solr was the only realistic choice for most customers wanting to do an enterprise search project based on open source. Things changed around 2010/2011 when Elasticsearch started to gain traction. The last few years, the community around Elasticsearch has been growing rapidly and the software is regularly downloaded approximately half a million [...]]]></description>
				<content:encoded><![CDATA[<p>For many years, Solr was the only realistic choice for most customers wanting to do an enterprise search project based on open source. Things changed around 2010/2011 when Elasticsearch started to gain traction. The last few years, the community around Elasticsearch has been growing rapidly and the software is regularly downloaded approximately half a million times each month.</p>
<p>While both platforms are based on Lucene, Elasticsearch has been built for scaling from day one, while it is often said Solr added this as an afterthought. Developers also generally find it very easy to interact with since it uses a JSON-based API model. Much could be said about the differences between the two platforms, but without going into a long list of technical details, Elasticsearch felt like a breath of fresh air when we first laid eyes on it a few years ago.</p>
<div style="width: 510px" class="wp-caption alignnone"><img src="https://farm4.staticflickr.com/3907/14171541129_54df6c5a1a.jpg" alt="" width="500" height="194" /><p class="wp-caption-text">Elasticsearch or Solr? <a href="http://www.google.com/trends/explore#q=solr%2C%20elasticsearch&amp;cmpt=q">Google trends will give you a nice input to the dilemma</a>.</p></div>
<p>Today the company behind the product, <a href="http://techcrunch.com/2014/06/05/elasticsearch-scores-70m-in-series-c-to-fund-growth-spurt/">announced they have raised a whopping 70M USD in funding</a>, bringing the total up to over 100M USD in the last 18 months!</p>
<p>For everyone loving the product, and everyone considering what platform to go with for their enterprise search needs, this is excellent news! Even though it is an open source product, and everyone in theory can submit suggested changes, bugfixes etc – there’s definite value in having a group of full time, dedicated developers building the product.</p>
<p>This means the product and the ecosystem around it will continue to evolve, improve and increase for a long time with considerable force.</p>
<p>So – for all those wondering what horse to put their bets on for future open source development;  <em>Picking the fastest running horse is often better than picking the horse that’s been around the longest.</em></p>
<p>Comperio is an official partner with Elasticsearch.</p>
<p><a href="http://www.comperio.no/frokost110614/">Please join us at our breakfast seminar in Oslo in a few days (June 11)</a>. Elasticsearch will be there to present alongside Microsoft and Google.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/06/elasticsearch-or-solr-picking-right-horse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Dynamic search ranking using Elasticsearch, Neo4j and Piwik</title>
		<link>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/#comments</comments>
		<pubDate>Wed, 05 Feb 2014 14:49:52 +0000</pubDate>
		<dc:creator><![CDATA[Christian Rieck]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[dynamic rank tuning]]></category>
		<category><![CDATA[dynamic search ranking]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[fast]]></category>
		<category><![CDATA[neo4j]]></category>
		<category><![CDATA[Piwik]]></category>
		<category><![CDATA[rank tuning]]></category>
		<category><![CDATA[ranking]]></category>
		<category><![CDATA[search ranking]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=1957</guid>
		<description><![CDATA[Getting the correct result at the top of your search results isn’t easy. Anyone working within search quickly realizes this. Tuning the underlying ranking model is a job that just doesn’t end. There is an entire profession about search engine optimization, making sure your site gets as high as possible on Google (and Bing, I [...]]]></description>
				<content:encoded><![CDATA[<div>
<p>Getting the correct result at the top of your search results isn’t easy. Anyone working within search quickly realizes this. Tuning the underlying ranking model is a job that just doesn’t end. There is an entire profession about search engine optimization, making sure your site gets as high as possible on Google (and Bing, I guess). If it is not the top result on Google, it is somehow your fault and not Google&#8217;s.<span id="more-1957"></span></p>
</div>
<div>
<p><strong>Nobody optimizes for an internal enterprise search solution</strong></p>
</div>
<div>
<p>If your document is not the top result in the internal search solution it is somehow the search engine&#8217;s fault, not yours. There is no link cardinality on a file system. All the metadata is wrong and the document your user is trying to find doesn’t even contain the words the user remembers it to contain; the end result being that the target document is not found. As a result of this, trust in the enterprise search diminishes and soon you are left without users. Let’s see how we can use <a title="Piwik" href="http://piwik.org">Piwik</a>, <a title="neo4j" href="http://www.neo4j.org">neo4j </a>and <a title="Elasticsearch" href="http://www.elasticsearch.org">Elasticsearch </a>to remedy this. (Yes, you can use <a title="Solr" href="http://lucene.apache.org/solr/">Solr</a> if you want).</p>
</div>
<div>
<p>This post is made up of three parts. First I’ll talk about gathering the data necessary. Then we’ll tackle getting the ‘right’ documents at the top of your search and lastly we’ll see if we can expand documents with words your users recalls  them by, but are not part of the documents themselves. The journey will be based on the work performed on Comperio’s internal search, at the moment implemented on an old Fast ESP installation.</p>
</div>
<div>
<p><strong>Gathering data</strong></p>
</div>
<div>
<p>First you need to know what your users are searching for and what they end up clicking on. We use Piwik, an open source web analytics platform, for this. Seeing the searches, modifications to the searches and if they ended up clicking on anything that they thought was exciting. For a while we only used this for statistics since Piwik offered better insight than the built in query statistics in Fast ESP. Here is an example of one search session:</p>
</div>
<div></div>
<div>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/piwik.png"><img class="alignnone size-full wp-image-1968" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/piwik.png" alt="" width="411" height="179" /></a></p>
<p>We see a user entering the site, querying ‘rank order words’ and clicking on a document. Then the same search is executed again. It is reasonable to conclude the clicked document did not contain the wanted information. Lastly ‘boost position term’ is searched. Sadly the session does not end with a click so I guess our search couldn’t deliver. :( [1]</p>
</div>
<div>
<p>In their current form, the statistics aren’t very useful. But what were to happen if we took these chains of activities and created a graph? We used neo4j for this. A small Java program was written to download the Piwik-history as an XML-file and insert it into a newly created neo4j database.</p>
</div>
<div>
<p>The nodes are either the start of a session, a search or a document. They are linked by relationships such as CLICKED, SEARCHED, RETURNED_FROM.  Since a neo4j database isn’t very screen shot friendly, here is a part of the graph as rendered by <a title="Gephi" href="https://gephi.org">Gephi</a>:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/chinese.png"><img class="alignnone size-full wp-image-1964" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/chinese.png" alt="" width="411" height="141" /></a></p>
</div>
<div></div>
<div>
<p>We see someone looking for help with Chinese query suggestions. S361 marks the beginning of this session and the first search term was ‘chinese’. They then clicked a link for an internal mail archive before refining their search to ‘chinese als’ and so forth. Links that show when a user back tracked are not shown. That was an isolated little island. The more central documents and search terms at your company will create bigger webs.</p>
</div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/web.png"><img class="alignnone size-full wp-image-1971" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/web.png" alt="" width="605" height="368" /></a></p>
</div>
<div></div>
<div>
<p>Seeing your search history organized like this should give an urge to dive in and explore. It is really interesting, fun and recommended!</p>
</div>
<div>
<p><strong>Finding popular documents</strong></p>
</div>
<div>
<p>The simplest way of finding the popular documents is to track search term -&gt; clicks directly. It is also the most common way of doing it. That wouldn’t utilize our fancy new graph now, would it? Since we can do queries against the database let’s get all search sessions of 8 or less actions that resulted in a click on document X:</p>
<p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/query.png"><img class="alignnone size-full wp-image-1969" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/query.png" alt="" width="605" height="40" /></a></p>
</div>
<div></div>
<div>
<p>Page Break(Small disclaimer: As my neo4j skills are very rudimentary there might be more efficient ways of doing this.)</p>
</div>
<div>
<p>Now we iterate over all sessions and give a score to each search term. The closer it is to the clicked document, the higher score it gets. Sum the score across all sessions.  After doing that you get a score indicating how ‘close’ a search term is to any document. This is example data for the single-word search term ‘vpn’:</p>
<p>&nbsp;</p>
</div>
<div><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/json.png"><img class="alignnone size-full wp-image-1966" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/json.png" alt="" width="605" height="66" /></a></div>
<div>
<p>&nbsp;</p>
<p>When the score passes a threshold we add the search-document pair to an Elasticsearch index. For every search executed at our search we first check Elasticsearch to see if the term is boosted. For ‘vpn’ the search logs state</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/log.png"><img class="alignnone size-full wp-image-1967" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/log.png" alt="" width="605" height="50" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>We can see how three documents are boosted for ‘vpn’. (By choice we only boost the top three). Using Fast ESP we wrap the original query with boosts for those specific documents.</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/fql.png"><img class="alignnone size-full wp-image-1965" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/fql.png" alt="" width="546" height="179" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>In FAST ESP, as well as in Sharepoint Search 2013 the beloved xrank-operator is your friend. In a Lucene based search application use boost queries for this.</p>
</div>
<div>
<p>The search result returns the popular hits (only one shown here) at the top</p>
</div>
<div></div>
<div>
<p> <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/topdoc.png"><img class="alignnone size-full wp-image-1970" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/topdoc.png" alt="" width="605" height="105" /></a></p>
</div>
<div>
<p>&nbsp;</p>
<p>The ugly star and cheesy feedback is me trying to tell the users rather bluntly that things happened behind the scene and that their actions will affect future searches. Currently there is no way of giving negative feedback to say ‘no, this is actually not a good hit’. Oh well.</p>
</div>
<div>
<p>As a bonus all terms that results in boosted documents are, as far as we know, smart things to search for and free of spelling errors. Therefor all such terms are added to a second Elasticsearch index we base our query completion on. (As a side note – if misspelled terms appear often enough to overcome the threshold for them to be taken into account, they could be part of your organization’s tribal language. If the users choose to spell the term “definately” so often that it “makes the cut” then the system should adapt to that. )</p>
</div>
<div>
<p><strong>Expanding documents to increase recall</strong></p>
</div>
<div>
<p>Often a user thinks of one document and searches for what, to them, identifies the document. That term might or might not be present in the document itself. If it doesn’t the document is not returned and the user becomes sad. Hopefully they alter their search and continue to look. Should they end up at their document we have the tools needed to remedy the situation. Here is a concrete example:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/02/arch.png"><img class="alignnone size-full wp-image-1963" src="http://blog.comperiosearch.com/wp-content/uploads/2014/02/arch.png" alt="" width="594" height="217" /></a></p>
</div>
<div></div>
<div>
<p>Here we can see that the node marked 1 might be tagged with ‘sort order refiner entries’ or at least ‘refiner’, a term used twice when trying to find this document. (As an interesting side note, if you observe a lot of ‘sort X’ followed by ‘sort Y’ you might consider adding a synonym between X and Y.) If a term or phrase is used often enough across different sessions we save this to an Elasticsearch index. Each time a document is indexed we look up the document in our index and add any popular search terms to a low ranking field. This guarantees a recall of the document but it will not automatically top the results for those queries. This is a two-step process. If your search engine supports partial updates of documents, go with that.</p>
</div>
<div>
<p>Before adding the last step we noticed that for some searches we boosted documents that didn’t get recalled and thus were never displayed to the user even though we knew it was a good hit!</p>
</div>
<div>
<p><strong>Closing words</strong></p>
</div>
<div>
<p>As a first step towards dynamic ranking this has shown good results. As long as your search engine supports query time boosting you can implement this.</p>
</div>
<div>
<p><strong>By the way</strong></p>
</div>
<div>
<p>It should be noted that SharePoint will actually do some of this for you. It comes with an interface meant to be used by an end user (as opposed to all search engines I’ve seen) and the UI contains the event listeners on all links, tracking what you do. This is fed into a database and the data does affect ranking. As far as I know only the last search term before a click is associated with the clicked link.</p>
</div>
<div>
<p>[1] One scenario that Piwik and click tracking does not pick up is if the sought information is found in the returned teasers. Search sessions that don’t end in a click might in fact have a happy ending.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/02/05/dynamic-search-ranking-using-elasticsearch-neo4j-and-piwik/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Search driven websites</title>
		<link>http://blog.comperiosearch.com/blog/2013/08/09/search-driven-websites/</link>
		<comments>http://blog.comperiosearch.com/blog/2013/08/09/search-driven-websites/#comments</comments>
		<pubDate>Fri, 09 Aug 2013 11:14:46 +0000</pubDate>
		<dc:creator><![CDATA[Håvard Eidheim, Victoria Værnø, Øyvind Wedøe]]></dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[handlebars]]></category>
		<category><![CDATA[Javascript]]></category>
		<category><![CDATA[prototype]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search driven]]></category>
		<category><![CDATA[search driven website]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[students]]></category>
		<category><![CDATA[summer]]></category>
		<category><![CDATA[summer internship]]></category>
		<category><![CDATA[summer job]]></category>
		<category><![CDATA[summer project]]></category>
		<category><![CDATA[wordpress]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=1626</guid>
		<description><![CDATA[Search driven sites lets the reader find what they need on their own premises, not the architect&#8217;s! This year&#8217;s summer internship program has been centered around the concept “Search driven websites”. Over the course of the summer we&#8217;ve gone from a concept and a vision to a prototype website, and we&#8217;d like to share the potential [...]]]></description>
				<content:encoded><![CDATA[<h4>Search driven sites lets the reader find what they need on their own premises, not the architect&#8217;s!</h4>
<p>This year&#8217;s summer internship program has been centered around the concept “Search driven websites”. Over the course of the summer we&#8217;ve gone from a concept and a vision to a prototype website, and we&#8217;d like to share the potential we&#8217;ve uncovered along the way with you.</p>
<h3 style="text-align: center">What is a search driven website?</h3>
<p style="text-align: left" dir="ltr">On a search driven website, everything is driven by search. More useful and less painfully obvious, a search driven website will get all its content, all its navigation and menus as a result of a search query. This allows a different and more dynamic approach to the structuring and placement of content, in contrast to a typical folder based, editor dependent website where someone has to decide exactly which articles and images will reside on the front page, or any page, at any one time.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/searchbased2.jpg"><img class="wp-image-1560 aligncenter" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/searchbased2.jpg" alt="" width="466" height="290" /></a></p>
<p style="text-align: left" dir="ltr">A traditional site requires the user to spend time learning and navigating a cramped and complex menu structure to find the content desired. We believe that navigation can and should be done in a better way, through the use of an ever-ready search bar, dynamic navigators and relevant links. Even when displaying a single page, a search is performed to retrieve that article along with related results.</p>
<p style="text-align: left" dir="ltr">There are immediate advantages to using search as the main logic module of a website. Take the front page: Set the site&#8217;s opening search to &#8220;*&#8221; &#8211; search everything &#8211; and make the search engine order the results by relevance, popularity and date. Display the results as proper, readable articles, and you get a site containing the content you want to display without having to manually specify what goes where. A clear advantage here is that popular results will be displayed automatically on the front page, letting many of your readers find what they want immediately. This will also let your site adapt to your readers&#8217; interests automatically, and not just when you perform a user test or go through a slew of analytics.</p>
<h3 style="text-align: center">A unified experience</h3>
<p dir="ltr">Within this vision of search driven websites is the idea that just as the front page content is retrieved with a search, the reader&#8217;s personal searches should also be displayed like the front page. Let&#8217;s use a specific example. Below is a screen shot from the front page of aftenposten.no, a large Norwegian news site.</p>
<p style="text-align: center"><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/front-page.png"><img class=" wp-image-1516 aligncenter" style="border: 1px solid black" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/front-page.png" alt="" width="599" height="380" /></a></p>
<p dir="ltr">This looks pretty good. Now let&#8217;s search for China and see how it looks.</p>
<p style="text-align: center"><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/search.png"><img class="wp-image-1515 aligncenter" style="border: 1px solid black" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/search.png" alt="" width="570" height="402" /></a></p>
<p dir="ltr">Not looking so good anymore. When we perform a search, it no longer looks like a newspaper. It no longer IS a newspaper, it is a mini Google. What if, instead, the results looked just like the front page?</p>
<p style="text-align: center"><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/front-page-search.png"><img class="aligncenter  wp-image-1514" style="border: 1px solid black" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/front-page-search.png" alt="" width="604" height="461" /></a></p>
<p style="text-align: left" dir="ltr">Now the search becomes an integrated part of the website. Readers no longer have to enter “search mode” to find what they are looking for, they simply type something into the search bar and get a site specifically tailored to their interests. It&#8217;s clean, fresh and feels natural. This part of the search driven vision could easily be implemented in most websites out there, and we believe that unifying the search and the rest of the website would make the Internet a more comfortable place to be.</p>
<h3 style="text-align: center">Reduce maintenance and integration costs</h3>
<p style="text-align: left" dir="ltr">The big advantage of search is that you can simply tell the search engine what you want and it will find the best content for you. We know this might sound obvious, but think about the consequences. Life becomes easier when all you need to do is ask the search to fill your site with content. Remember, both the site administrator and the readers can have a say in what the search engine will look for. To base your website on search can be a good alternative when it is important that relevant and updated content is displayed at all times. If your content is located in many different places and you have different readers with different information needs, a search based solution could be your best bet.</p>
<p style="text-align: left" dir="ltr">With a search based website, the decisions become all about what to search for in which situations. By gathering statistics and usage patterns, the search can be tweaked to give each reader a personalized experience. Whenever you want to add a new page to your site about a particular topic, you&#8217;d simply get the content of your new page through a customized search. The rest is just creating relevant content.</p>
<p style="text-align: left" dir="ltr">When businesses get  new software systems, merge with or acquire other businesses, you often get several sources for your content. This is an organizational pain and comes with high software integration costs. In the purest version of the search driven vision, all you would need to do is make sure your search engine can index the new source. There is no need to move your content or cross/double post articles, as long as the content is in the search engine it&#8217;ll be available through your search driven site.</p>
<p style="text-align: center"><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/search-based-sources1.jpg"><img class=" wp-image-1601 aligncenter" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/search-based-sources1.jpg" alt="" width="486" height="324" /></a></p>
<p style="text-align: left" dir="ltr">Readers often want an overview of what exists on your site as they look for content. Search based navigation could be the answer. Up until recent years, people have been used to navigating through mutually exclusive folders, and this is the way many websites are built today. An alternative approach is to let the search engine figure out which categories exist, for example by observing where pages have been posted or what they are tagged with. Let clicks in the menu navigation be searches for the menu keywords, or a search for elements with a specific category tag. The real locations and structure of your content will no longer be important. This reduces the integration work on site navigation by not having to consider the actual locations of the content, just make sure the new content is categorized or tagged properly.</p>
<h3 style="text-align: center"> Our prototype</h3>
<p>This summer we have created a prototype for this concept, giving the rest of Comperio an example implementation and a platform for further discussions. You can  see our prototype at the address below.  To check out the category navigation click the &#8220;show categories&#8221; in the bottom right corner.</p>
<p style="text-align: center"><a title="beta.comperio.no" href="http://beta.comperio.no">beta.comperio.no</a></p>
<p>The site is composed of four parts. A search engine, an admin interface, a search API and a client. The client is created using Javascript and handlebars.js, the search engine is an Apache Solr instance and the admin interface and search API are both contained in a wordpress plugin. The client requests content from the search API, which uses the search engine as a back end to find relevant content before it is returned to the client. Everything found in the prototype is completely search driven, including the front page, single page view, related items and category navigation. The admin interface lets you control indexing as well as different search parameters.</p>
<h3 style="text-align: center">A last note</h3>
<p style="text-align: left">Over the summer we have built a website completely based on search. We believe that using search as the main content delivery method has great potential. Businesses today have vast amounts of information, and there is an ever increasing need for sharing, accessibility and flexibility. These are challenges that can&#8217;t be ignored and we have enjoyed uncovering, discussing, and hopefully getting closer to a solution to these problems in light of the concept of search driven web sites.</p>
<p style="text-align: center"><a href="http://blog.comperiosearch.com/wp-content/uploads/2013/08/sommerstudenter1.jpg"><img class=" wp-image-1644 aligncenter" src="http://blog.comperiosearch.com/wp-content/uploads/2013/08/sommerstudenter1.jpg" alt="" width="512" height="341" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2013/08/09/search-driven-websites/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
