Solr: Indexing SQL databases made easier!

Seb Muller — Thu, 28 Aug 2014 12:05:17 +0000

Update

Part two is now available here!

At the beginning of this year Christopher Vig wrote a great post about indexing an SQL database to the internet’s current search engine du jour, Elasticsearch. This first post in a two part series will show that Apache Solr is a robust and versatile alternative that makes indexing an SQL database just as easy. The second will go deeper into how to make leverage Solr’s features to create a great backend for a people search solution.

Solr ships with a configuration driven contrib called the DataImportHandler. It provides a way to index structured data into Solr in both full and incremental delta imports. We will cover a simple use case of the tool i.e. indexing a database containing personnel data to form the basis of a people search solution. You can also easily extend the DataImportHandler tool via various APIs to pre-process data and handle more complex use cases.

For now, let’s stick with basic indexing of an SQL database.

Setting up our environment

Before we get started, there are a few requirements:

Java 1.7 or greater
For this demo we’ll be using a MySQL database
A copy of the sample employees database
The MySQL jdbc driver

With that out of the way, let’s get Solr up and running and ready for database indexing:

Download Solr and extract it to a directory of your choice.
Open solr-4.9.0/example/solr/collection1/conf/solrconfig.xml in a text editor and add the following within the config tags:
In the same directory, open schema.xml and add this this line
Create a lib subdir in solr-4.9.0/solr/collection1/ and extract the MySQL jdbc driver jar into it. It’s the file called mysql-connector-java-{version}-bin.jar
To start Solr, open a terminal and navigate to the example subdir in your extracted Solr directory and run java -jar start.jar

When started this way, Solr runs by default on port 8983. If you need to change this, edit solr-4.9.0/example/etc/jetty.xml and restart Solr.

Navigate to http://localhost:8983/solr and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select the default core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of Can't find resource 'db-data-config.xml' in classpath. This is normal as we haven’t actually created this file yet, which stores the configs for connecting to our target database.

We’ll come back to that file later but let’s make our demo database now. If you haven’t already downloaded the sample employees database and installed MySQL, now would be a good time!

Setting up our database

Assuming your MySQL server is installed and running, access the MySQL terminal and create the empty employees database: create database employees;

Exit the MySQL terminal and import the employees.sql into your empty database, ensuring that you carry out the following command from the same directory as the employees.sql file itself: mysql -u root -p employees < employees.sql

You can test this was successful by logging into the MySql server and querying the database, as shown here on the right.

Having successfully created and populated your employee database, we can now create that missing db-data-config.xml file.

Indexing our database

In your Solr conf directory, which contains the schema.xml and solrconfig.xml we previously modified, create a new file called db-data-config.xml.

Its contents should look like the example below. Make sure to replace the user and password values with yours and feel free to modify or remove the limit parameter. There’s approximately 30’000 entries in the employees table in total

We’re now going to make use of Solr’s REST-like HTTP API with a couple of commands worth saving. I prefer to use the Postman app on Chrome and have created a public collection of HTTP requests, which you can import into Postman’s Collections view using this url: https://www.getpostman.com/collections/9e95b8130556209ed643

For those of you not using Chrome, here are the commands you will need: First let’s reload the core so that Solr is
aware of the new db-data-config.xml file we have created.
Next, we index our database with the HTTP request or from within the Solr Admin GUI on the DataImport page.

Here we have carried out a full index of our database using the full-import command parameter. To only retrieve changes since the last import, we would use delta-import instead.

We can confirm that our database import was successful by querying our index with the “Retrieve all” and “Georgi query” requests.

Finally, to schedule reindexing you can use a simple cronjob. This one, for example, will run everyday at 23:00 and retrieve all changes since the previous indexing operation:

Conclusion

So far we have successfully

Setup a database with content
Indexed the database into our Solr index
Setup basic scheduled delta reindexing

In the next part of this two part series we will look at how to process our indexed data. Specifically, with a view to making a good people search solution. We will implement several features such as phonetic search, spellcheck and basic query completion. In the meantime, let’s carry on the conversation in the comments below!