<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; database</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Solr: Indexing SQL databases made easier! &#8211; Part 2</title>
		<link>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/#comments</comments>
		<pubDate>Tue, 14 Apr 2015 12:56:21 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[solr5]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3477</guid>
		<description><![CDATA[Last summer I wrote a blog post about indexing a MySQL database into Apache Solr. I would like to now revisit the post to update it for use with Solr 5 and start diving into how to implement some basic search features such as Facets Spellcheck Phonetic search Query Completion Setting up our environment The [...]]]></description>
				<content:encoded><![CDATA[<p>Last summer I wrote a <a href="http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/">blog post</a> about indexing a MySQL database into <a href="http://lucene.apache.org/solr/">Apache Solr</a>. I would like to now revisit the post to update it for use with Solr 5 and start diving into how to implement some basic search features such as</p>
<ul>
<li>Facets</li>
<li>Spellcheck</li>
<li>Phonetic search</li>
<li>Query Completion</li>
</ul>
<h2>Setting up our environment</h2>
<p>The requirements remain the same as with the original blogpost:</p>
<ol>
<li>Java 1.7 or greater</li>
<li>A <a href="http://dev.mysql.com/downloads/mysql/">MySQL</a> database</li>
<li>A copy of the <a href="https://launchpad.net/test-db/employees-db-1/1.0.6/+download/employees_db-full-1.0.6.tar.bz2">sample employees database</a></li>
<li>The MySQL <a href="http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.32.tar.gz">jdbc driver</a></li>
</ol>
<p>We&#8217;ll now be using Solr 5, which runs a little differently from previous incarnations of Solr. Download <a href="http://www.apache.org/dyn/closer.cgi/lucene/solr/5.0.0">Solr</a> and extract it to a directory of your choice.Open a terminal and navigate to your Solr directory.<br />
Start Solr with the command <pre class="crayon-plain-tag">bin/solr start</pre> .<img class="alignright wp-image-3497 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-11-at-20.30.03-300x114.png" alt="Solr Status" width="300" height="114" /></p>
<p>To confirm Solr successfully started up, run <pre class="crayon-plain-tag">bin/solr status</pre></p>
<p>Unlike previously, we now need to create a Solr core for our employee data. To do so run this command <pre class="crayon-plain-tag">bin/solr create_core -c employees -d basic_configs</pre> . This will create a core named employees using Solr&#8217;s minimal configuration options. Try <pre class="crayon-plain-tag">bin/solr create_core -help</pre>  to see what else is possible.</p>
<ol>
<li>Open server/solr/employees/conf/solrconfig.xml in a text editor and add the following within the config tags:
<div id="file-dataimporthandler2-LC1" class="line">
<pre class="crayon-plain-tag">&lt;lib dir="../../../dist/" regex="solr-dataimporthandler-\d.*\.jar" /&gt;
 
&lt;requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"&gt;
&lt;lst name="defaults"&gt;
&lt;str name="config"&gt;db-data-config.xml&lt;/str&gt;
&lt;/lst&gt;
&lt;/requestHandler&gt;</pre>
</div>
</li>
<li>In the same directory, open schema.xml and add this this line:<br />
<pre class="crayon-plain-tag">&lt;dynamicField name="*_name" type="text_general" multiValued="false" indexed="true" stored="true" /&gt;</pre>
</li>
<li>Create a lib subdir in server/solr/employees and extract the MySQL jdbc driver jar into it.</li>
<li>Finally, restart the Solr server with the command <pre class="crayon-plain-tag">bin/solr restart</pre></li>
</ol>
<p>When started this way, Solr runs by default on port 8983. Use <pre class="crayon-plain-tag">bin/solr start -p portnumber</pre>  and replace portnumber with your preferred choice to start it on that one.</p>
<p>Navigate to <a href="http://localhost:8983/solr">http://localhost:8983/solr</a> and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select our employee core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of <pre class="crayon-plain-tag">Can't find resource 'db-data-config.xml' in classpath</pre> . This is normal as we haven&#8217;t actually created this file yet, which stores the configs for connecting to our target database.</p>
<p>We&#8217;ll come back to that file later but let&#8217;s make our demo database now. If you haven&#8217;t already downloaded the sample employees database and installed MySQL, now would be a good time!</p>
<h2>Setting up our database</h2>
<p>Please refer to the instructions in the same section in the <a href="http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/">original blog post</a>. The steps are still the same.</p>
<h2>Indexing our database</h2>
<p>Again, please refer to the instructions in the same section in the original blog post. The only difference is the Postman collection should be imported from <a href="https://www.getpostman.com/collections/f7634c89cd9851dd2c13"> this url</a> instead. The commands you can use alternatively have also changed and are now</p><pre class="crayon-plain-tag">Clear index: http://localhost:8983/solr/employees/update?stream.body=&lt;delete&gt;&lt;query&gt;*:*&lt;/query&gt;&lt;/delete&gt;&amp;commit=true
Retrieve all: http://localhost:8983/solr/employees/select?q=*:*&amp;omitHeader=true
Index db: http://localhost:8983/solr/employees/collection1/dataimport?command=full-import
Reload core: http://localhost:8983/solr/employees/admin/cores?action=RELOAD&amp;core=collection1
Georgi query: http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name&amp;defType=edismax
Facet query: http://localhost:8983/solr/employees/select?q=*:*&amp;wt=json&amp;facet=true&amp;facet.field=dept_s&amp;facet.field=title_s&amp;facet.mincount=1&amp;rows=0
Gorgi spellcheck: http://localhost:8983/solr/employees/select?q=gorgi&amp;wt=json&amp;qf=first_name&amp;defType=edismax
Georgi Phonetic: http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name%20phonetic&amp;defType=edismax</pre><p></p>
<h2>The next step</h2>
<p>We should now be back where we ended with the original blog post. So far we have successfully</p>
<ul>
<li>Setup a database with content</li>
<li>Indexed the database into our Solr index</li>
<li>Setup basic scheduled delta reindexing</li>
</ul>
<p>Let&#8217;s get started with the more interesting stuff!</p>
<h2>Facets</h2>
<p>Facets, also known as filters or navigators, allow a search user to refine and drill down through search results. Before we get started with them, we need to update our data import configuration. Replace the contents of our existing db-data-config.xml with:</p>
<div id="file-db-data-config2-LC1" class="line">
<pre class="crayon-plain-tag">select e.emp_no as 'id', e.birth_date,
(
select t.title
order by t.`from_date` desc
limit 1
) as 'title_s', e.first_name, e.last_name, e.gender as 'gender_s', d.`dept_name` as 'dept_s'
from employees e
join dept_emp de on de.`emp_no` = e.`emp_no`
join departments d on d.`dept_no` = de.`dept_no`
join titles t on t.`emp_no` = e.`emp_no`
group by e.`emp_no`
limit 1000;</pre><br />
To be able to facet, we need appropriate fields upon which to actually facet. Our new SQL retrieves additional fields such as employee titles and departments. Fields perfect for use as facets.</p>
</div>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/04/Screen-Shot-2015-09-23-at-10.27.47.png"><img class="aligncenter size-medium wp-image-3520" src="http://blog.comperiosearch.com/wp-content/uploads/2015/04/Screen-Shot-2015-09-23-at-10.27.47.png" alt="Updated Employee SQL" width="300" height="166" /></a><br />
You&#8217;ll notice we map title, gender and dept_name to title_s, gender_s and dept_s respectively. This allows us to take advantage of an existing dynamic field mapping in Solr&#8217;s default basic config, *_s. A dynamic field allows us to assign all fields with a certain pre/suffix the same field type. In this case, given the field type <pre class="crayon-plain-tag">&lt;dynamicField name="*_s" type="string" indexed="true" stored="true" /&gt;</pre> , any fields ending with _s will be indexed and stored as basic strings. Solr will not tokenise them and modify their contents. This allows us to safely use them for faceting without worrying about department titles being split on white spaces for example.</p>
<ol>
<li>Clear the index and restart Solr.<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-13-at-17.06.22.png"><img class="alignright wp-image-3533 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-13-at-17.06.22-196x300.png" alt="Facet Query" width="196" height="300" /></a></li>
<li>Once Solr has restarted, reindex the database with<br />
our new SQL. Don&#8217;t be alarmed if this takes a bit longer<br />
than previously. It&#8217;s a bit more heavy weight and not<br />
very well optimised!</li>
<li>Once it&#8217;s done indexing, we can<br />
confirm it was successful by running the facet query via<br />
Postman or directly in our browser.</li>
<li>We should see two hits for the query &#8220;georgi&#8221; along with<br />
facets for their respective titles and department.</li>
</ol>
<p>&nbsp;</p>
<h2>The anatomy of a facet query</h2>
<p>Let&#8217;s take a closer look at the relevant request parameters of our facet query: <pre class="crayon-plain-tag">http://localhost:8983/solr/employees/select?q=georgi&amp;wt=json&amp;qf=first_name%20last_name&amp;defType=edismax&amp;omitHeader=true&amp;facet=true&amp;facet.field=dept_s&amp;facet.field=title_s&amp;facet.mincount=1</pre></p>
<ul>
<li>facet &#8211; Tells Solr to enable or prevent faceting. Accepted values include yes,on and true to enable, no, off and false to disable</li>
<li>facet.field &#8211; Which field we want to facet on, can be defined multiple times</li>
<li>facet.mincount &#8211; The minimum number of values for a particular facet value the query results includes for it to be included in the facet result object. Can be defined per facet field with this syntax f.fieldName.facet.mincount=1</li>
</ul>
<p>There are many other facet parameters. I recommend taking a look at the Solr wiki pages on <a href="https://wiki.apache.org/solr/SolrFacetingOverview">faceting</a> and other <a href="https://wiki.apache.org/solr/SimpleFacetParameters">possible parameters</a>.</p>
<h2>Spellcheck</h2>
<p>Analysing query logs and focusing on those queries that gave zero hits is a quick and easy way to see what can and should be done to improve your search solution. More often than not you will come across a great deal of spelling errors. Adding spellcheck to a search solution gives such great value for a tiny bit of effort. This fruit is so low hanging it should hit you in the face!</p>
<p>To enable spellcheck, we need to make some configuration changes.</p>
<ol>
<li>In our schema.xml, add these two lines after the *_name dynamic field type we added earlier:
<div class="line">
<pre class="crayon-plain-tag">&lt;copyField source="*_name" dest="spellcheck" /&gt;
&lt;field name="spellcheck" type="text_general" indexed="true" stored="true" multiValued="true" /&gt;</pre>
</div>
<p>A copyField checks for fields whose names match the pattern defined in source and copies their destinations to the dest field. In our case, we will copy content from first_name and last_name to spellcheck. We then define the spellcheck field as multiValued to handle its multiple sources.</li>
<li>Add the following to our solrconfig.xml:
<div id="file-spellcheck-LC1" class="line">
</p><pre class="crayon-plain-tag">&lt;searchComponent name="spellcheck" class="solr.SpellCheckComponent"&gt;
&lt;str name="queryAnalyzerFieldType"&gt;text_general&lt;/str&gt;
&lt;!-- a spellchecker built from a field of the main index --&gt;
&lt;lst name="spellchecker"&gt;
&lt;str name="name"&gt;default&lt;/str&gt;
&lt;str name="field"&gt;spellcheck&lt;/str&gt;
&lt;str name="classname"&gt;solr.DirectSolrSpellChecker&lt;/str&gt;
&lt;!-- the spellcheck distance measure used, the default is the internal levenshtein --&gt;
&lt;str name="distanceMeasure"&gt;internal&lt;/str&gt;
&lt;!-- minimum accuracy needed to be considered a valid spellcheck suggestion --&gt;
&lt;float name="accuracy"&gt;0.5&lt;/float&gt;
&lt;!-- the maximum #edits we consider when enumerating terms: can be 1 or 2 --&gt;
&lt;int name="maxEdits"&gt;2&lt;/int&gt;
&lt;!-- the minimum shared prefix when enumerating terms --&gt;
&lt;int name="minPrefix"&gt;1&lt;/int&gt;
&lt;!-- maximum number of inspections per result. --&gt;
&lt;int name="maxInspections"&gt;5&lt;/int&gt;
&lt;!-- minimum length of a query term to be considered for correction --&gt;
&lt;int name="minQueryLength"&gt;4&lt;/int&gt;
&lt;!-- maximum threshold of documents a query term can appear to be considered for correction --&gt;
&lt;float name="maxQueryFrequency"&gt;0.01&lt;/float&gt;
&lt;!-- uncomment this to require suggestions to occur in 1% of the documents
&lt;float name="thresholdTokenFrequency"&gt;.01&lt;/float&gt;
--&gt;
&lt;/lst&gt;
&lt;/searchComponent&gt;</pre><p>
</div>
<p>This will create a spellchecker component that uses the spellcheck field as its dictionary source. The spellcheck field contains content copied from both first and last name fields.</li>
<li>In the same file, look for the select requestHandler and update it to include the spellcheck component:
<div id="file-select-LC1" class="line">
</p><pre class="crayon-plain-tag">&lt;requestHandler name="/select" class="solr.SearchHandler"&gt;
&lt;!-- default values for query parameters can be specified, these
will be overridden by parameters in the request
--&gt;
&lt;lst name="defaults"&gt;
&lt;str name="echoParams"&gt;explicit&lt;/str&gt;
&lt;int name="rows"&gt;10&lt;/int&gt;
&lt;str name="spellcheck"&gt;on&lt;/str&gt;
&lt;str name="spellcheck.dictionary"&gt;default&lt;/str&gt;
&lt;/lst&gt;
&lt;!-- Add this to enable spellcheck --&gt;
&lt;arr name="last-components"&gt;
&lt;str&gt;spellcheck&lt;/str&gt;
&lt;/arr&gt;
&lt;/requestHandler&gt;</pre><p>
</div>
</li>
</ol>
<p>The defaults list in a requestHandler defines which default parameters to add to each request made using the chosen request handler. You could, for example, define which fields to query. In this case we&#8217;re enabling spellcheck and using the default dictionary as defined in our solrconfig.xml. All values in the defaults list can be overwritten per request. To include request parameters that cannot be overwritten, we would need to use an invariants list instead:</p><pre class="crayon-plain-tag">&lt;lst name="invariants"&gt;
&lt;str name="defType"&gt;edismax&lt;/str&gt;
&lt;/lst&gt;</pre><p>Both lists can be used simultaneously. When duplicate keys are present the values in the invariants list will take precedence.</p>
<p>Once we&#8217;ve made all our configuration changes, let&#8217;s restart Solr and reindex. To verify the changes worked, do a basic retrieve all query and check the resulting documents for the spellcheck field. Its contents should be the same as the document&#8217;s first_name and last_name fields.</p>
<p>Because we have enabled spellcheck by default,<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-15.20.45.png"><img class="alignright size-medium wp-image-3575" src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-15.20.45-174x300.png" alt="Gorgi" width="174" height="300" /></a> queries with possible suggestions will include contents in the spellcheck response object.</p>
<p>Try the Gorgi spellcheck query and experiment with different queries. To query the last_name field as well, change the qf parameter to <pre class="crayon-plain-tag">qf=first_name last_name</pre>.</p>
<p>The qf parameter defines which fields to use as the search domain.</p>
<p>When the spellcheck response object has content, you can easily use it to implement a basic &#8220;did you mean&#8221; feature. This will vastly improve your zero hit page.</p>
<h2>Phonetic Search</h2>
<p>Now that we have a basic spellcheck component in place, the next best feature that easily creates value in a people search system is phonetics. Solr ships with some <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory">basic</a> <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.DoubleMetaphoneFilterFactor">phonetic</a> <a href="https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.BeiderMorseFilterFactory">tokenisers</a>. The most commonly used out of the box phonetic tokeniser is the DoubleMetaphoneFilterFactory. It will suffice for most use cases. It does, however, have some weaknesses, which we will go into briefly in the next section.</p>
<p>We need to once again modify our schema.xml to take advantage of Solr&#8217;s phonetic capabilities. Add the following:</p><pre class="crayon-plain-tag">&lt;fieldType name="phonetic" class="solr.TextField" &gt;
 &lt;analyzer&gt;
 &lt;tokenizer class="solr.StandardTokenizerFactory"/&gt;
 &lt;filter class="solr.DoubleMetaphoneFilterFactory" inject="true" maxCodeLength="4"/&gt;
 &lt;/analyzer&gt;
 &lt;/fieldType&gt;

 &lt;copyField source="*_name" dest="phonetic" /&gt;
 &lt;field name="phonetic" type="phonetic" indexed="true" stored="false" multiValued="true" /&gt;</pre><p>Similar to spellcheck, we copy contents from the name fields into a phonetic field. Here we define a phonetic field, whose values will not be stored as we don&#8217;t need to return them in search results. It is, however, indexed so we can actually include it in the search domain. Finally, like spellcheck, it is multivalued to handle multiple potential sources. The reason we create an additional search field is so we can apply different weightings to exact matches and phonetic matches.</p>
<p>Restart Solr, clear the index and reindex.</p>
<p>Running the Georgi Phonetic search request should now returns hits based on exact and phonetic matches. To ensure that exact matches are ranked higher, we can add a query time boost to our query fields: <pre class="crayon-plain-tag">&amp;qf=first_name last_name phonetic^0.5</pre></p>
<p>Rather than apply boosts to fields we want to rank higher, it&#8217;s usually simpler to apply a punitive boost to fields we wish to rank lower. Replace the qf parameter in the Georgi Phonetic request and see how the first few results all have an exact match for georgi in the first_name field.</p>
<h2>Query Analysis</h2>
<p>As we look further down the result set, you will notice some strange matches. One employee, called Kirk Kalsbeek, is apparently a match for &#8220;georgi&#8221;. To understand why this is a match, we can use Solr&#8217;s analysis tool.<br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-17.09.01.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2022/04/Screen-Shot-2015-04-14-at-17.09.01-300x247.png" alt="Solr Analysis" width="300" height="247" class="aligncenter size-medium wp-image-3585" /></a><br />
It allows use to define an indexed value, a query value and the field type to use and then demonstrate how each value is tokenised and whether or not the query would result in a match.</p>
<p>With the values Kirk Kalsbeek, georgi and phonetic respectively, the analysis tool shows us that Kirk gets tokenised to KRK by our phonetic field type. Georgi is also tokenised to KRK, which results in a match.</p>
<p>To create a better phonetic search solution, we would have to implement a custom phonetic tokeniser. I came across <a href="https://github.com/kvalle/norphoname"> an example</a>, which has helped me enormously in improving phonetic search for Norwegian names on a project.</p>
<h2>Conclusion</h2>
<p>We should now be able to </p>
<ul>
<li>Implement index field based spellcheck</li>
<li>Use basic faceting</li>
<li>Implement Solr&#8217;s out of the box phonetic capabilities</li>
</ul>
<p>Query completion I will leave for the next time. I promise you won&#8217;t have to wait as long between posts as last time :)</p>
<p>Let me know how you get on in the comments below!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Replacing FAST ESP with Elasticsearch at Posten</title>
		<link>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/#comments</comments>
		<pubDate>Fri, 20 Mar 2015 10:00:52 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Comperio]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[elastic]]></category>
		<category><![CDATA[fast]]></category>
		<category><![CDATA[geosearch]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>
		<category><![CDATA[posten]]></category>
		<category><![CDATA[tilbudssok]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3364</guid>
		<description><![CDATA[First, some background A few years ago Comperio launched a nifty service for Posten Norge, Norway&#8217;s postal service. Through the service, retail companies can upload their catalogues and seasonal flyers to make the products listed within searchable. Although the catalogue handling and processing is also very interesting, we&#8217;re going to focus on the search side [...]]]></description>
				<content:encoded><![CDATA[<h2>First, some background</h2>
<p>A few years ago Comperio launched a nifty service for <a title="Posten Norge" href="http://www.posten.no/">Posten Norge</a>, Norway&#8217;s postal service. Through the service, retail companies can upload their catalogues and seasonal flyers to make the products listed within searchable. Although the catalogue handling and processing is also very interesting, we&#8217;re going to focus on the search side of things in this post. As Comperio has a long relationship and a great deal of experience with <a title="FAST ESP" href="http://blog.comperiosearch.com/blog/2012/07/30/comperio-still-likes-fast-esp/">FAST ESP</a>, this first iteration of Posten&#8217;s <a title="Tilbudssok" href="http://tilbudssok.posten.no/">Tilbudssok</a> used it as the search backend. It also incorporated Comperio Front, our search middleware product, which recently <a title="Comperio Front" href="http://blog.comperiosearch.com/blog/2015/02/16/front-5-released/">had a big release. </a>.</p>
<h2>Newer is better</h2>
<p>Unfortunately, FAST ESP is getting on a bit and as a result Tilbudssok has been limited by what we can coax out of it. To ensure we provide the best possible search solution we decided it was time to upgrade and chose <a title="Elasticsearch" href="https://www.elastic.co/products">Elasticsearch</a> as the best candidate. If you are unfamiliar with Elasticsearch, take a moment to browse our other <a title="Elasticsearch blog posts" href="http://blog.comperiosearch.com/blog/tag/elasticsearch/">blog posts</a> on the subject. The resulting project had three main requirements:</p>
<ul>
<li>Replace Fast ESP with Elasticsearch while otherwise maintaining as much of the existing architecture as possible</li>
<li>Add geodata to products such that a user could find the nearest store where they were available</li>
<li>Setup sexy log analysis with <a title="Logstash" href="https://www.elastic.co/products/logstash">Logstash</a> and <a title="Kibana" href="https://www.elastic.co/products/kibana">Kibana</a></li>
</ul>
<p></br></p>
<h2>Data Sources, Ingestion and Processing</h2>
<p>The data source for the search system is a MySQL database populated with catalogue and product data. A separate Comperio system generates this data when Posten&#8217;s customers upload PDFs of their brochures i.e. we also fully own the entire data generation process.</p>
<p>The FAST ESP based solution made use of FAST&#8217;s JDBC connector to feed data directly to the search index. Inspired by <a title="Elasticsearch: Indexing SQL databases. The easy way." href="http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/">Christoffers blog post</a>, we made use of the <a title="Elasticsearch JDBC River Plugin" href="https://github.com/jprante/elasticsearch-river-jdbc">JDBC plugin for Elasticsearch</a>. This allowed us to use the same SQL statements to feed Elasticsearch. It took us no more than a couple of hours, including some time wrestling with field mappings, to populate our Elasticsearch index with the same data as the FAST one.</p>
<p>We then needed to add store geodata to the index. As mentioned earlier, we completely own the data flow. We simply extended our existing catalogue/product uploader system to include a store uploader service. Google&#8217;s <a title="Google Geocoder" href="https://code.google.com/p/geocoder-java/">geocoder</a> handled converted addresses to coordinates for use with Elasticsearch&#8217;s geo distance sorting. We now had store data in our database. An extra JDBC river and another round of mapping wrestling got that same data into the Elasticsearch index.</p>
<h2>Our approach</h2>
<p>Before the conversion to Elasticsearch, the Posten system architecture was typical of most Comperio projects. Users interact with a Java based frontend web application. This in turn sends queries to Comperio&#8217;s search abstraction layer, <a title="Comperio Front" href="http://blog.comperiosearch.com/blog/2015/02/16/front-5-released/">Comperio Front</a>. This formats requests such that the system&#8217;s search engine, in our case FAST ESP, can understand them. Upon receiving a response from the search engine, Front then formats it into a frontend friendly format i.e. JSON or XML depending on developer preference.</p>
<p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/tilbudssok_architecture.png"><img class="size-medium wp-image-3422 aligncenter" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/tilbudssok_architecture-300x145.png" alt="Generic Search Architecture" width="300" height="145" /></a></p>
<p>Unfortunately, when we started the project, Front&#8217;s Elasticsearch adapter was still a bit immature. It also felt a bit over kill to include it when Elasticsearch has such a <a href="http://www.elastic.co/guide/en/elasticsearch/client/java-api/current/">robust Java API</a> already. I saw an opportunity to reduce the system&#8217;s complexity and learn more about interacting with Elasticsearch&#8217;s Java API and took it. With what I learnt, we could later beef up Front&#8217;s Elasticsearch adapter for future projects.</p>
<p>As a side note, we briefly flirted with the idea of replacing the entire frontend with a <a href="http://blog.comperiosearch.com/blog/2013/10/24/instant-search-with-angularjs-and-elasticsearch/">hipstery Javascript/Node.js ecosystem</a>. It was trivial to throw together a working system very quickly but in the interest of maintaining existing architecture and trying to keep project run time down we opted to stick with the existing Java based MVC framework.</p>
<p>After a few rounds of Googling, struggling with documentation and finally simply diving into the code, I was able to piece together the bits of the Elasticsearch Java API puzzle. It is a joy to work with! There are builder classes for pretty much everything. All of our queries start with a basic SearchRequestBuilder. Depending on the scenario, we can then modify this SRB with various flavours of QueryBuilders, FilterBuilders, SortBuilders and AggregationBuilders to handle every potential use case. Here is a greatly simplified example of a filtered search with aggregates:</p>
<script src="https://gist.github.com/92772945f5281df54c3b.js?file=SRBExample"></script>
<h2>Logstash and Kibana</h2>
<p>With our Elasticsearch based system up ready to roll, the next step was to fulfil our sexy query logging project requirement. This raised an interesting question. Where are the query logs? As it turns out, (please contact us if we&#8217;re wrong), the only query logging available is something called <a title="Slow Log" href="http://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-slowlog.html">slow logging</a>. It is a shard level log where you can set thresholds for the query or fetch phase of the execution. We found this log severely lacking in basic details such as hit count and actual query parameters. It seemed like we could only track query time and the query string.</p>
<p>Rather than fight with this slow log, we implemented our own custom logger in our web app to log salient parts of the search request and response. To make our lives easier everything is logged as JSON. This makes hooking up with <a title="Logstash" href="http://logstash.net/">Logstash</a> trivial, as our logstash config reveals:</p>
<script src="https://gist.github.com/43e3603bd75fd549a582.js?file=logstashconf"></script>
<p><a title="Kibana 4" href="http://blog.comperiosearch.com/blog/2015/02/09/kibana-4-beer-analytics-engine/">Kibana 4</a>, the latest version of Elastic&#8217;s log visualisation suite, was released in February, around the same time as we were wrapping up our logging logic. We had been planning on using Kibana 3, but this was a perfect opportunity to learn how to use version 4 and create some awesome dashboards for our customer:</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_query.png"><img class="aligncenter size-medium wp-image-3444" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_query-300x169.png" alt="kibana_query" width="300" height="169" /></a></p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_ams.png"><img class="aligncenter size-medium wp-image-3443" src="http://blog.comperiosearch.com/wp-content/uploads/2015/03/kibana_ams-300x135.png" alt="kibana_ams" width="300" height="135" /></a></p>
<p>Kibana 4 is wonderful to work with and will generate so much extra value for Posten and their customers.</p>
<h2>Conclusion</h2>
<ul>
<li>Although the Elasticsearch Java API itself is well rounded and complete, its documentation can be a bit frustrating. But this is why we write blog posts to share our experiences!</li>
<li>Once we got past the initial learning curve, we were able to create an awesome Elasticsearch Java API toolbox</li>
<li>We were severely disappointed with the built in query logging. I hope to extract our custom logger and make it more generic so everyone else can use it too.</li>
<li>The Google Maps API is fun and super easy to work with</li>
</ul>
<p>Rivers as a data ingestion tool have long been marked for deprecation. When we next want to upgrade our Elasticsearch version we will need to replace them entirely with some other tool. Although Logstash is touted as Elasticsearch&#8217;s main equivalent of a connector framework, it currently lacks classic Enterprise search data source connectors. <a title="Apache Manifold" href="http://manifoldcf.apache.org/">Apache Manifold</a> is a mature open source connector framework that would cover our needs. The latest release has not been tested with the latest version of Elasticsearch, but it supports versions 1.1-3.</p>
<p>Once the solution goes live, during April, Kibana will really come into its own as we get more and more data.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/03/20/elasticsearch-at-posten/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Solr: Indexing SQL databases made easier!</title>
		<link>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/#comments</comments>
		<pubDate>Thu, 28 Aug 2014 12:05:17 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[jdbc]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[people search]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2848</guid>
		<description><![CDATA[Update Part two is now available here! At the beginning of this year Christopher Vig wrote a great post about indexing an SQL database to the internet&#8217;s current search engine du jour, Elasticsearch. This first post in a two part series will show that Apache Solr is a robust and versatile alternative that makes indexing [...]]]></description>
				<content:encoded><![CDATA[<h3>Update</h3>
<p>Part two is now available <a href="http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/">here!</a></p>
<hr />
<p>At the beginning of this year <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christopher Vig</a> wrote a <a href="http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/">great post </a>about indexing an SQL database to the internet&#8217;s current search engine du jour, <a href="http://www.elasticsearch.org/">Elasticsearch.</a> This first post in a two part series will show that <a href="http://lucene.apache.org/solr/">Apache Solr</a> is a robust and versatile alternative that makes indexing an SQL database just as easy. The second will go deeper into how to make leverage Solr&#8217;s features to create a great backend for a people search solution.</p>
<p>Solr ships with a configuration driven contrib called the <a href="http://wiki.apache.org/solr/DataImportHandler">DataImportHandler.</a> It provides a way to index structured data into Solr in both full and incremental delta imports. We will cover a simple use case of the tool i.e. indexing a database containing personnel data to form the basis of a people search solution. You can also easily extend the DataImportHandler tool via various <a href="http://wiki.apache.org/solr/DataImportHandler#Extending_the_tool_with_APIs">APIs</a> to pre-process data and handle more complex use cases.</p>
<p>For now, let&#8217;s stick with basic indexing of an SQL database.</p>
<h2>Setting up our environment</h2>
<p>Before we get started, there are a few requirements:</p>
<ol>
<li>Java 1.7 or greater</li>
<li>For this demo we&#8217;ll be using a <a href="http://dev.mysql.com/downloads/mysql/">MySQL</a> database</li>
<li>A copy of the <a href="https://launchpad.net/test-db/employees-db-1/1.0.6/+download/employees_db-full-1.0.6.tar.bz2">sample employees database</a></li>
<li>The MySQL <a href="http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.32.tar.gz">jdbc driver</a></li>
</ol>
<p>With that out of the way, let&#8217;s get Solr up and running and ready for database indexing:</p>
<ol>
<li>Download <a href="https://lucene.apache.org/solr/downloads.html">Solr</a> and extract it to a directory of your choice.</li>
<li>Open solr-4.9.0/example/solr/collection1/conf/solrconfig.xml in a text editor and add the following within the config tags:  <script src="https://gist.github.com/dd7cef212fd7f6a415b5.js?file=DataImportHandler"></script></li>
<li>In the same directory, open schema.xml and add this this line   <script src="https://gist.github.com/5bbc8c6e1a5b617b5d16.js?file=names"></script></li>
<li>Create a lib subdir in solr-4.9.0/solr/collection1/ and extract the MySQL jdbc driver jar into it. It&#8217;s the file called mysql-connector-java-{version}-bin.jar</li>
<li>To start Solr, open a terminal and navigate to the example subdir in your extracted Solr directory and run <code>java -jar start.jar</code></li>
</ol>
<p>When started this way, Solr runs by default on port 8983. If you need to change this, edit solr-4.9.0/example/etc/jetty.xml and restart Solr.</p>
<p>Navigate to <a href="http://localhost:8983/solr">http://localhost:8983/solr</a> and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select the default core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of <code>Can't find resource 'db-data-config.xml' in classpath</code>. This is normal as we haven&#8217;t actually created this file yet, which stores the configs for connecting to our target database.</p>
<p>We&#8217;ll come back to that file later but let&#8217;s make our demo database now. If you haven&#8217;t already downloaded the sample employees database and installed MySQL, now would be a good time!</p>
<h2>Setting up our database</h2>
<p>Assuming your MySQL server is installed <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase.png"><img class="alignright size-full wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase-300x226.png" alt="Prepare indexing database" width="300" height="226" /></a>and running, access the MySQL terminal and create the empty employees database: <code>create database employees;</code></p>
<p>Exit the MySQL terminal and import the employees.sql into your empty database, ensuring that you carry out the following command from the same directory as the employees.sql file itself: <code>mysql -u root -p employees &lt; employees.sql</code></p>
<p>You can test this was successful by logging <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase.png"><img class="alignright size-medium wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase-276x300.png" alt="Verify indexing database" width="276" height="300" /></a>into the MySql server and querying the database, as shown here on the right.</p>
<p>Having successfully created and populated your employee database, we can now create that missing db-data-config.xml file.</p>
<h2>Indexing our database</h2>
<p>In your Solr conf directory, which contains the schema.xml and solrconfig.xml we previously modified, create a new file called db-data-config.xml.</p>
<p>Its contents should look like the example below. Make sure to replace the user and password values with yours and feel free to modify or remove the limit parameter. There&#8217;s approximately 30&#8217;000 entries in the employees table in total <script src="https://gist.github.com/03935f1384e150504363.js?file=db-data-config"></script></p>
<p>We&#8217;re now going to make use of Solr&#8217;s REST-like HTTP API with a couple of commands worth saving. I prefer to use the <a href="https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm">Postman app</a> on Chrome and have created a public collection of HTTP requests, which you can import into Postman&#8217;s Collections view using this url: <a href="https://www.getpostman.com/collections/9e95b8130556209ed643">https://www.getpostman.com/collections/9e95b8130556209ed643</a></p>
<p>For those of you not using Chrome, here are the commands you will need:<script src="https://gist.github.com/05a2a1dd01a6c5a4517b.js?file=solr-http"></script> First let&#8217;s reload the core so that Solr is <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore.png"><img class="alignright size-medium wp-image-2921" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore-300x181.png" alt="Reload Solr core" width="300" height="181" /></a><br />
aware of the new db-data-config.xml file we have created.<br />
Next, we index our database with the <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb.png"><img class="alignright size-medium wp-image-2923" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb-300x181.png" alt="Index database to Solr" width="300" height="181" /></a>HTTP request or from within the Solr Admin GUI on the DataImport page.</p>
<p>Here we have carried out a full index of our database using the full-import command parameter. To only retrieve changes since the last import, we would use delta-import instead.</p>
<p>We can confirm that our database import was successful by querying our index with the &#8220;Retrieve all&#8221; and &#8220;Georgi query&#8221; requests.</p>
<p>Finally, to schedule reindexing you can use a simple cronjob. This one, for example, will run everyday at 23:00 and retrieve all changes since the previous indexing operation:<script src="https://gist.github.com/47f6df5a306e4cd51617.js?file=delta"></script></p>
<h2>Conclusion</h2>
<p>So far we have successfully</p>
<ul>
<li>Setup a database with content</li>
<li>Indexed the database into our Solr index</li>
<li>Setup basic scheduled delta reindexing</li>
</ul>
<p>In the next part of this two part series we will look at how to process our indexed data. Specifically, with a view to making a good people search solution. We will implement several features such as phonetic search, spellcheck and basic query completion. In the meantime, let&#8217;s carry on the conversation in the comments below!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>
