<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; jdbc</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/jdbc/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Solr: Indexing SQL databases made easier!</title>
		<link>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/#comments</comments>
		<pubDate>Thu, 28 Aug 2014 12:05:17 +0000</pubDate>
		<dc:creator><![CDATA[Seb Muller]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[MySQL]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[jdbc]]></category>
		<category><![CDATA[mysql]]></category>
		<category><![CDATA[people search]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2848</guid>
		<description><![CDATA[Update Part two is now available here! At the beginning of this year Christopher Vig wrote a great post about indexing an SQL database to the internet&#8217;s current search engine du jour, Elasticsearch. This first post in a two part series will show that Apache Solr is a robust and versatile alternative that makes indexing [...]]]></description>
				<content:encoded><![CDATA[<h3>Update</h3>
<p>Part two is now available <a href="http://blog.comperiosearch.com/blog/2015/04/14/solr-indexing-index-sql-databases-made-easier-part-2/">here!</a></p>
<hr />
<p>At the beginning of this year <a href="http://blog.comperiosearch.com/blog/author/cvig/">Christopher Vig</a> wrote a <a href="http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/">great post </a>about indexing an SQL database to the internet&#8217;s current search engine du jour, <a href="http://www.elasticsearch.org/">Elasticsearch.</a> This first post in a two part series will show that <a href="http://lucene.apache.org/solr/">Apache Solr</a> is a robust and versatile alternative that makes indexing an SQL database just as easy. The second will go deeper into how to make leverage Solr&#8217;s features to create a great backend for a people search solution.</p>
<p>Solr ships with a configuration driven contrib called the <a href="http://wiki.apache.org/solr/DataImportHandler">DataImportHandler.</a> It provides a way to index structured data into Solr in both full and incremental delta imports. We will cover a simple use case of the tool i.e. indexing a database containing personnel data to form the basis of a people search solution. You can also easily extend the DataImportHandler tool via various <a href="http://wiki.apache.org/solr/DataImportHandler#Extending_the_tool_with_APIs">APIs</a> to pre-process data and handle more complex use cases.</p>
<p>For now, let&#8217;s stick with basic indexing of an SQL database.</p>
<h2>Setting up our environment</h2>
<p>Before we get started, there are a few requirements:</p>
<ol>
<li>Java 1.7 or greater</li>
<li>For this demo we&#8217;ll be using a <a href="http://dev.mysql.com/downloads/mysql/">MySQL</a> database</li>
<li>A copy of the <a href="https://launchpad.net/test-db/employees-db-1/1.0.6/+download/employees_db-full-1.0.6.tar.bz2">sample employees database</a></li>
<li>The MySQL <a href="http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.32.tar.gz">jdbc driver</a></li>
</ol>
<p>With that out of the way, let&#8217;s get Solr up and running and ready for database indexing:</p>
<ol>
<li>Download <a href="https://lucene.apache.org/solr/downloads.html">Solr</a> and extract it to a directory of your choice.</li>
<li>Open solr-4.9.0/example/solr/collection1/conf/solrconfig.xml in a text editor and add the following within the config tags:  <script src="https://gist.github.com/dd7cef212fd7f6a415b5.js?file=DataImportHandler"></script></li>
<li>In the same directory, open schema.xml and add this this line   <script src="https://gist.github.com/5bbc8c6e1a5b617b5d16.js?file=names"></script></li>
<li>Create a lib subdir in solr-4.9.0/solr/collection1/ and extract the MySQL jdbc driver jar into it. It&#8217;s the file called mysql-connector-java-{version}-bin.jar</li>
<li>To start Solr, open a terminal and navigate to the example subdir in your extracted Solr directory and run <code>java -jar start.jar</code></li>
</ol>
<p>When started this way, Solr runs by default on port 8983. If you need to change this, edit solr-4.9.0/example/etc/jetty.xml and restart Solr.</p>
<p>Navigate to <a href="http://localhost:8983/solr">http://localhost:8983/solr</a> and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select the default core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of <code>Can't find resource 'db-data-config.xml' in classpath</code>. This is normal as we haven&#8217;t actually created this file yet, which stores the configs for connecting to our target database.</p>
<p>We&#8217;ll come back to that file later but let&#8217;s make our demo database now. If you haven&#8217;t already downloaded the sample employees database and installed MySQL, now would be a good time!</p>
<h2>Setting up our database</h2>
<p>Assuming your MySQL server is installed <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase.png"><img class="alignright size-full wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/12/createdatabase-300x226.png" alt="Prepare indexing database" width="300" height="226" /></a>and running, access the MySQL terminal and create the empty employees database: <code>create database employees;</code></p>
<p>Exit the MySQL terminal and import the employees.sql into your empty database, ensuring that you carry out the following command from the same directory as the employees.sql file itself: <code>mysql -u root -p employees &lt; employees.sql</code></p>
<p>You can test this was successful by logging <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase.png"><img class="alignright size-medium wp-image-2900" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/testdatabase-276x300.png" alt="Verify indexing database" width="276" height="300" /></a>into the MySql server and querying the database, as shown here on the right.</p>
<p>Having successfully created and populated your employee database, we can now create that missing db-data-config.xml file.</p>
<h2>Indexing our database</h2>
<p>In your Solr conf directory, which contains the schema.xml and solrconfig.xml we previously modified, create a new file called db-data-config.xml.</p>
<p>Its contents should look like the example below. Make sure to replace the user and password values with yours and feel free to modify or remove the limit parameter. There&#8217;s approximately 30&#8217;000 entries in the employees table in total <script src="https://gist.github.com/03935f1384e150504363.js?file=db-data-config"></script></p>
<p>We&#8217;re now going to make use of Solr&#8217;s REST-like HTTP API with a couple of commands worth saving. I prefer to use the <a href="https://chrome.google.com/webstore/detail/postman-rest-client/fdmmgilgnpjigdojojpjoooidkmcomcm">Postman app</a> on Chrome and have created a public collection of HTTP requests, which you can import into Postman&#8217;s Collections view using this url: <a href="https://www.getpostman.com/collections/9e95b8130556209ed643">https://www.getpostman.com/collections/9e95b8130556209ed643</a></p>
<p>For those of you not using Chrome, here are the commands you will need:<script src="https://gist.github.com/05a2a1dd01a6c5a4517b.js?file=solr-http"></script> First let&#8217;s reload the core so that Solr is <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore.png"><img class="alignright size-medium wp-image-2921" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/reloadcore-300x181.png" alt="Reload Solr core" width="300" height="181" /></a><br />
aware of the new db-data-config.xml file we have created.<br />
Next, we index our database with the <a href="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb.png"><img class="alignright size-medium wp-image-2923" src="http://blog.comperiosearch.com/wp-content/uploads/2014/08/indexdb-300x181.png" alt="Index database to Solr" width="300" height="181" /></a>HTTP request or from within the Solr Admin GUI on the DataImport page.</p>
<p>Here we have carried out a full index of our database using the full-import command parameter. To only retrieve changes since the last import, we would use delta-import instead.</p>
<p>We can confirm that our database import was successful by querying our index with the &#8220;Retrieve all&#8221; and &#8220;Georgi query&#8221; requests.</p>
<p>Finally, to schedule reindexing you can use a simple cronjob. This one, for example, will run everyday at 23:00 and retrieve all changes since the previous indexing operation:<script src="https://gist.github.com/47f6df5a306e4cd51617.js?file=delta"></script></p>
<h2>Conclusion</h2>
<p>So far we have successfully</p>
<ul>
<li>Setup a database with content</li>
<li>Indexed the database into our Solr index</li>
<li>Setup basic scheduled delta reindexing</li>
</ul>
<p>In the next part of this two part series we will look at how to process our indexed data. Specifically, with a view to making a good people search solution. We will implement several features such as phonetic search, spellcheck and basic query completion. In the meantime, let&#8217;s carry on the conversation in the comments below!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/08/28/indexing-database-using-solr/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Elasticsearch: Indexing SQL databases. The easy way.</title>
		<link>http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/#comments</comments>
		<pubDate>Wed, 29 Jan 2014 23:42:24 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[connector]]></category>
		<category><![CDATA[elastic search]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[etl]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[jdbc]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search-index]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=1895</guid>
		<description><![CDATA[Elasticsearch is a great search engine, flexible, fast and fun. So how can I get started with it? This post will go through how to get contents from a SQL database into Elasticsearch. Rivers are deprecated since Elasticsearch version 1.5. Read this official statement https://www.elastic.co/blog/deprecating_rivers. However, river-jdbc lives on as elasticsearch JDBC importer. Some day this post [...]]]></description>
				<content:encoded><![CDATA[<p><a title="Elasticsearch" href="http://www.elasticsearch.org">Elasticsearch </a>is a great search engine, flexible, fast and fun. So how can I get started with it? This post will go through how to get contents from a SQL database into Elasticsearch.</p>
<p><span id="more-1895"></span><span style="color: #ff0000;"><strong>Rivers are deprecated since Elasticsearch version 1.5. Read this official statement <a href="https://www.elastic.co/blog/deprecating_rivers"><span style="color: #ff0000;">https://www.elastic.co/blog/deprecating_rivers</span></a>. However, river-jdbc lives on as <a href="https://github.com/jprante/elasticsearch-jdbc">elasticsearch JDBC importer</a>. Some day this post will be updated with instructions for using JDBC importer mode. </strong></span></p>
<p>Elasticsearch has a set of pluggable services called rivers. A river runs inside an Elasticsearch node, and imports content into the index. There are rivers for twitter, redis, files, and of course, SQL databases. The <a title="river-jdbc" href="https://github.com/jprante/elasticsearch-river-jdbc">river-jdbc plugin</a> connects to SQL databases using JDBC adapters. In this post we will use PostgreSQL, since it is freely available, and populate it with some contents that also are freely available.</p>
<p>So let’s get started</p>
<ol>
<li>Download and install <a title="Elasticsearch download" href="http://www.elasticsearch.org/download/">Elasticsearch</a></li>
<li>Start elasticsearch by running <em>bin/elasticsearch </em>from the installation folder</li>
<li>Install the river-jdbc plugin for Elasticsearch version 1.00RC<br />
<pre class="crayon-plain-tag">./bin/plugin -install river-jdbc -url&nbsp;&lt;em&gt;&lt;a href=&quot;http://bit.ly/1dKqNJy&quot;&gt;http://bit.ly/1dKqNJy&lt;/a&gt; &lt;/em&gt;</pre>
</li>
<li>Download <a href="http://jdbc.postgresql.org/download.html">the PostgreSQL JDBC jar file</a> and copy into the <em>plugins/river-jdbc</em> folder. You should probably <a title="http://jdbc.postgresql.org/download/postgresql-9.3-1100.jdbc41.jar" href="http://jdbc.postgresql.org/download/postgresql-9.3-1100.jdbc41.jar">get the latest version which is for JDBC 41</a></li>
<li>Install PostgreSQL <a href="http://www.postgresql.org/download/">http://www.postgresql.org/download/</a></li>
<li>Import the booktown database. Download the <a title="http://www.commandprompt.com/ppbook/booktown.sql" href="http://www.commandprompt.com/ppbook/booktown.sql">sql file from booktown database</a></li>
<li>Restart Elasticsearch</li>
<li>Start PostgreSQL</li>
</ol>
<p>By this time you should have Elasticsearch and PostgreSQL running, and river-jdbc ready to use.</p>
<p>Now we need to put some contents into the database, using psql, the PostgreSQL command line tool.</p><pre class="crayon-plain-tag">psql -U postgres -f booktown.sql</pre><p>To execute commands to Elasticsearch we will use an online service which functions as a mixture of <a href="https://gist.github.com/">Gist</a>, the code snippet sharing service and <a href="https://chrome.google.com/webstore/detail/sense/doinijnbnggojdlcjifpdckfokbbfpbo">Sense</a>, a Google Chrome plugin developer console for Elasticsearch. The service is hosted by <a title="http://qbox.io" href="http://qbox.io">http://qbox.io</a>, who provide hosted Elasticsearch services.</p>
<p>Check that everything was correctly installed by opening a browser to <a href="http://sense.qbox.io/gist/8361346733fceefd7f364f0ae1ebe7efa856779e">http://sense.qbox.io/gist/8361346733fceefd7f364f0ae1ebe7efa856779e</a></p>
<p>Select the top most line in the left-hand pane, press CTRL+Enter on your keyboard. You may also click on the little triangle that appears to the right, if you are more of a mouse click kind of person.</p>
<p>You should now see a status message, showing the version of Elasticsearch, node name and such.</p>
<p>Now let’s stop fiddling around the porridge and create a river for our database:</p><pre class="crayon-plain-tag">curl -XPUT &quot;http://localhost:9200/_river/mybooks/_meta&quot; -d'
{
&quot;type&quot;: &quot;jdbc&quot;,
&quot;jdbc&quot;: {
&quot;driver&quot;: &quot;org.postgresql.Driver&quot;,
&quot;url&quot;: &quot;jdbc:postgresql://localhost:5432/booktown&quot;,
&quot;user&quot;: &quot;postgres&quot;,
&quot;password&quot;: &quot;postgres&quot;,
&quot;index&quot;: &quot;booktown&quot;,
&quot;type&quot;: &quot;books&quot;,
&quot;sql&quot;: &quot;select * from authors&quot;
}
}'</pre><p>This will create a “one-shot” river that connects to PostgreSQL on Elasticsearch startup, and pulls the contents from the authors table into the booktown index. The index parameter controls what index the data will be put into, and the type parameter decides the type in the Elasticsearch index. To verify the river was correctly uploaded execute</p><pre class="crayon-plain-tag">GET /_river/mybooks/_meta</pre><p>Restart Elasticsearch, and watch the log for status messages from river-jdbc. Connection problems, SQL errors or other problems should appear in the log . If everything went OK, you should see something like &#8230;SimpleRiverMouth] bulk [1] success [19 items]</p>
<p>Time has come to check out what we got.</p><pre class="crayon-plain-tag">GET /booktown/_search</pre><p>You should now see all the contents from the authors table. The number of items reported under &#8220;hits&#8221; -&gt; &#8220;total&#8221; are the same as what we just saw in the log: 19.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/01/hitsSampleBookTown1.png"><img class="alignleft size-full wp-image-1921" src="http://blog.comperiosearch.com/wp-content/uploads/2014/01/hitsSampleBookTown1.png" alt="" width="371" height="431" /></a><br />
But looking more closely at the data, we can see that the _id field has been auto-assigned with some random values. This means that the next time we run the river, all the contents will be re-added.</p>
<p>Luckily, river-jdbc support some <a title="Labeled columns" href="https://github.com/jprante/elasticsearch-river-jdbc/wiki/Labeled-columns">specially labeled fields</a>, that let us control how the contents should be indexed.</p>
<p>Reading up on the docs, we change the SQL definition in our river to</p><pre class="crayon-plain-tag">select id as _id, first_name,
 last_name from authors</pre><p>We need to start afresh and scrap the index we just created:</p><pre class="crayon-plain-tag">DELETE /booktown</pre><p>Restart Elasticsearch. Now you should see a meaningful id in your data.</p>
<p>At this time we could start toying around with queries, mappings and analyzers. But, that&#8217;s not much fun with this little content. We need to join in some tables and get some more interesting data. We can join in the books table, and get all the books for all authors.</p><pre class="crayon-plain-tag">SELECT authors.id as _id, authors.last_name, authors.first_name,
books.id, books.title, books.subject_id 
FROM public.authors left join public.books on books.author_id = authors.id</pre><p>Delete the index, restart Elasticsearch and examine the data. Now you see that we only get one book per author. Executing the SQL statement in pgadmin returns 22 rows, while in Elasticsearch we get 19. This is on account of the _id field, on each attempt to index an existing record with the same _id as a new one, it will be overwritten.</p>
<p>River-jdbc supports <a href="https://github.com/jprante/elasticsearch-river-jdbc/wiki/Structured-Objects">Structured objects</a>, which allows us to create arbitrarily structured JSON documents simply by using SQL aliases. The _id column is used for identity, structured objects will be appended to existing data. This is perhaps best shown by an example:</p><pre class="crayon-plain-tag">SELECT authors.id as _id, authors.last_name, authors.first_name,&nbsp;
books.id as \&quot;Books.id\&quot;, books.title as \&quot;Books.title\&quot;, 
 books.subject_id as \&quot;Books.subject_id\&quot; 
FROM public.authors left join public.books on books.author_id = authors.id order by authors.id</pre><p>Again, delete the index, restart Elasticsearch, wait a few seconds before you search, and you will find structured data in the search results.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2014/01/hitsSampleBookTownWithTwoBooks.png"><img class="alignleft size-full wp-image-1918" src="http://blog.comperiosearch.com/wp-content/uploads/2014/01/hitsSampleBookTownWithTwoBooks.png" alt="" width="259" height="385" /></a></p>
<p>Now we have seen that it is quite easy to get data into Elasticsearch using river-jdbc. We have also seen how it can handle updates. That gets us quite far. Unfortunately, it doesn&#8217;t handle deletions. If a record is deleted from the database, it will not automatically be deleted from the index. There have been some attempts to create support for it, but in the latest release it has been completely dropped.</p>
<p>This is due to the river plugin system having some serious problems, and it will perhaps be deprecated some time after the 1.0 release, at least not actively promoted as &#8220;the way&#8221;. (<a href="http://www.linkedin.com/groups/Official-guide-writing-ElasticSearch-rivers-3393294.S.268274223">see the &#8220;semi-offical statement&#8221; at Linkedin Elasticsearch group</a>). While it is extremely easy to use rivers to get data, there are a lot of problems in having a data integration process running in the same space as Elasticsearch itself. Architecturally, it is perhaps more correct to leave the search engine to itself, and build integrations systems on the side.</p>
<p>Among the recommended alternatives are:<br />
&gt;Use an ETL tool like <a href="http://www.talend.com/">Talend</a><br />
&gt;Create your own script<br />
&gt;Edit the source application to send updates to Elasticsearch</p>
<p>Jörg Prante, who is the man behind river-jdbc, recently started creating a replacement called <a href="https://github.com/jprante/elasticsearch-gatherer">Gatherer</a>.<br />
It is a gathering framework plugin for fetching and indexing data to Elasticsearch, with scalable components.</p>
<p>Anyway, we have data in our index! Rivers may have their problems when used on a large scale, but you would be hard pressed to find anything easier to get started with. Getting data into the index easily is essential when exploring ideas and concepts, creating POCs or just fooling around.</p>
<p>This post has run out of space, but perhaps we can look at some interesting queries next time?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/01/30/elasticsearch-indexing-sql-databases-the-easy-way/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
	</channel>
</rss>
