Search Nuggets » Solr

Sitevision – förbättra söket med Nutch

Jack Thorén — Wed, 08 Jun 2016 14:20:07 +0000

Ett av Sveriges mest populära CMS verktyg är Sitevision, som används kanske främst av stora statliga myndigheter och kommuner. Valet att använda sig av Sitevision hos dessa myndigheter och kommuner är nog att det är väldigt enkelt för redaktörer och sidansvariga att använda och att underhålla informationen på sina sidor. Detta i en miljö där kanske den webbtekniska kunskapen inte är på samma nivå som hos ett större teknikföretag.

Men medans vi hyllar det enkla användargränssnittet så önskar vi att det gick att bygga bättre sökfunktionalitet. Visst kan du söka i en webbsajt, det går även att söka på andra webbsajter, om du har satt upp flera webbsajter inom samma system. Men om du vill söka i en webbsajt eller databas som finns någon annanstans då går det inte. Men detta är på väg att ändras. Sitevision introducerar snart webbkravlaren Nutch, en mycket avancerad webcrawler som bygger på Hadoop, som i sin tur är del av ett ramverk för att hantera mycket stora mängder data. Nutch tillsammans med Solr kommer att lyfta Sitevisions sök till nya höjder.

Nedan är ett schema för hur en sajt-indexering skulle kunna se ut:

1. ”Injector” tar alla webbadresser i nutch.txt filen och lägger till dem i ”CrawlDB”. Som är en central del av Nutch. CrawlDB innehåller information om alla kända webbadresser (hämta schema, hämta status, metadata, …).

2. Baserat på data från CrawlDB skapar ”Generator” en lista på vad som ska hämtas och placerar det i en nyskapad segment katalog.

3. Nästa steg, ”Fetcher” får de adresser som ska hämtas från listan och skriver det tillbaka till segment katalogen. Detta steg är vanligtvis den mest tidskrävande delen.

4. Nu kan ”Parser” behandla innehållet i varje webbsida och exempelvis utelämnar alla html-taggar. Om denna hämtning (crawl) är en uppdatering eller en utökning av en redan tidigare hämtning (t.ex. djup 3), skulle ”Updater” lägga till nya data till CrawlDB som ett nästa steg.

5. Före indexering måste alla länkar inverteras av ”Link Inverter”, som tar hänsyn till att inte antalet utgående länkar på en webbsida är av intresse, utan snarare antalet inkommande länkar. Detta är ganska likt hur Google Pagerank fungerar och är viktig för scoring funktion. De inverterade länkarna sparas i “Linkdb”.

6-7. Med hjälp av data från alla möjliga källor (CrawlDB, LinkDB och segment), skapar Indexer ett index och sparar det i Solr katalogen. För indexering, används det populära Lucene biblioteket. Nu kan användaren söka efter information om genomsökta webbsidor via Solr.

Funktionalitet som följer med Nutch:

Indexering av externa källor
Automatisk kategorisering
Metadata
Textanalys
Utökad funktionalitet enkelt med plugins

Nyttiga länkar:

Analysing Solr logs with Logstash

Seb Muller — Sun, 20 Sep 2015 22:00:00 +0000

Analysing Solr logs with Logstash

Although I usually write about and work with Apache Solr, I also use the ELK stack on a daily basis on a number of projects. If you’re not familiar with Solr, take a look at some of my previous posts. If you need some more background info on the ELK stack, both Christoffer and André have written many great posts on various ELK subjects. The most common use for the stack is data analysis. In our case, Solr search log analysis.

As a little side note for the truly devoted Solr users, an ELK stack alternative exists with SiLK. I highly recommend checking out Lucidworks’ various blog posts on Solr and search in general.

Some background

On an existing search project I use the ELK stack to ingest, analysis and visualise logs from Comperio’s search middleware application.

Although this gave us a great view of user query behaviour, Solr logs a great more detailed information. I wanted to log indexing events, errors and the searches with all other parameters in addition to just the query string.

Lets get started

I’m going to assume you already have a running Solr installation. You will, however, need to download Elasticsearch and Logstash and unpack them. Before we start Elasticsearch, I recommend installing these plugins:

Head
Marvel

Head is a cluster health monitoring tool. Marvel we’ll only need for the bundled developer console, Sense. To disable Marvel’s other capabilities, add this line to ~/elasticsearch/config/elasticsearch.yml

marvel.agent.enabled: false

Start elasticsearch with this command:

~/elasticsearch-[version]/bin/elasticsearch

Navigate to http://localhost:9200/ to confirm that Elasticsearch is running. Check http://localhost:9200/_plugin/head and http://localhost:9200/_plugin/marvel/sense/index.html to verify the plugins installed correctly.

The anatomy of a Logstash config

Update 21/09/15

I have since greatly simplified the multiline portions of the Logstash configs. Use instead this filter section: The rest of the original article contents are unchanged for comparison’s sake.

All Logstash configs share three main building blocks. It starts with the Input stage, which defines what the data source is and how to access it. Next is the Filter stage, which carries out data processing and extraction. Finally, the Output stage tells Logstash where to send the processed data. Lets start with the basics, the input and output stages:

input {
  file {
    path => "~/solr.log"
  }
}

filter {}

output {
  # Send directly to local Elasticsearch
  elasticsearch_http {
    host => "localhost"
    template => "~/logstash/bin/logstash_solr_template.json"
    index => "solr-%{+YYYY.MM.dd}"
    template_overwrite => true
  }

This is one of the simpler input/output configs. We read a file at a given location and stream its raw contents to an Elasticsearch instance. Take a look at the input and output plugins’ documentation for more details and default values. The index setting causes Logstash to create a new index every day with a name generated from the provided pattern. The template option tells Logstash what kind of field mapping and settings to use when creating the Elasticsearch indices. You can find the template file I used here.

To process the Solr logs, we’ll use the grok, mutate, multiline, drop and kv filter plugins.

Grok is a regexp based parsing stage primarily used to match strings and extract parts. There are a number of default patterns described on the grok documentation page. While building your grok expressions, the grok debugger app is particularly helpful. Be mindful though that some of the escaping syntax isn’t always the same in the app as what the Logstash config expects.
We need the multiline plugin to link stacktraces to their initial error message.
The kv, aka key value, plugin will help us extract the parameters from Solr indexing and search events
We use mutate to add and remove tags along the way.
And finally, drop to drop any events we don’t want to keep.

The hard fun part

Lets dive into the filter stage now. Take a look at the config file I’m using. The Grok patterns may appear a bit daunting, especially if you’re not very familiar with regexp and the default Grok patterns, but don’t worry! Lets break it down.

The first section extracts the log event’s severity and timestamp into their own fields, ‘level’ and ‘LogTime’:

grok {
    match => { "message" => "%{WORD:level}.+?- %{DATA:LogTime};" }
      tag_on_failure => []
  }

So, given this line from my example log file

INFO  - 2015-09-07 15:40:34.535; org.apache.solr.update.processor.LogUpdateProcessor; [sintef_main] webapp=/ path=/update/extract params={literal.source=epifile&literal.epi_file_title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.title=GOFER+L4.0+Demonstratorer+V1.0.pdf&literal.id=epifile_211278&literal.epifileid_s=211278&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/6060/prosjektfiler/gofer/gofer-l4.0-demonstratorer-v1.0.pdf&literal.filesource_s=SiteFile} {} 0 65

We’d extract

{ "level": "INFO", "LogTime":"2015-09-07 15:40:34.535"}

In the template file I linked earlier, you’ll notice configuration for the LogTime field. Here we define for Elasticsearch a valid DateTime format. We need to do this so that Kibana recognises the field as one we can use for temporal analyses. Otherwise the only timestamp field we’d have would contain the time at which the logs were processed and stored in Elasticsearch. Although not a problem in a realtime log analyses system, if you have old logs you want to parse you will need to define this separate timestamp field. As an additional sidenote, you’ll notice I use

tag_on_failure => []

in most of my Grok stages. The default value is “_grokparsefailure”, which I don’t need in a production system. Custom failure and success tags are very helpful to debug your Logstash configs.

The next little section combines commit messages into a single line. The first event in the example log file is an example of such commit messages split over three lines.

# Combine commit events into single message
  multiline {
      pattern => "^\t(commit\{)"
      what => "previous"
    }

Now we come to a major section for handling general INFO level messages.

# INFO level events treated differently than ERROR
  if "INFO" in [level] {
    grok {
      match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\}.+?\{%{WORD:action}=\[%{DATA:docId}"
        }
        tag_on_failure => []  
    }
    if [params] {
      kv {
        field_split => "&"
        source => "params"
      }
    } else {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+?commits"  
        }
        tag_on_failure => [ "drop" ]
        add_field => {
          "action" => "commit"
        }
      }
      if "drop" in [tags] {
        drop {}
      }
    }
  }

This filter will only run on INFO level messages, due to the conditional at its beginning. The first Grok stage matches log events similar to the one above. The key fields we extract are the Solr collection/core, the endpoint we hit e.g. update/extract, the parameters supplied by the HTTP request, what action e.g. add or delete and finally the document ID. If the Grok succeeded to extract a params field, we run the key value stage, splitting no ampersand to extract each HTTP parameter. This is how a resulting document’s extracted contents look like when stored in Elasticsearch:

{
  "level": "INFO",
  "LogTime": "2015-09-07 15:40:18.938",
  "collection": "sintef_main",
  "endpoint": "/update/extract",
  "params":     "literal.source=epifile&literal.epi_file_title=A05100_Tass5+Trondheim.pdf&literal.title=A05100_Tass5+Trondheim.pdf&literal.id=epifile_211027&literal.epifileid_s=211027&literal.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&stream.url=http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf&literal.filesource_s=SiteFile",
  "action": "add",
  "docId": "epifile_211027",
  "version": "1511661994131849216",
  "literal.source": "epifile",
  "literal.epi_file_title": "A05100_Tass5+Trondheim.pdf",
  "literal.title": "A05100_Tass5+Trondheim.pdf",
  "literal.id": "epifile_211027",
  "literal.epifileid_s": "211027",
  "literal.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "stream.url": "http://www.sintef.no/globalassets/upload/teknologi_samfunn/5036/a05100_tass5-trondheim.pdf",
  "literal.filesource_s": "SiteFile"
}

If the Grok did not extract a params field, I want to identify possible commit messages with the following Grok. If this one fails we tag messages with “drop”. Finally, any messages tagged with “drop” are dropped for the pipeline. I specifically created these Grok patterns to match indexing and commit messages as I already track queries at the middleware layer in our stack. If you want to track queries at the Solr level, simply use this pattern:

.+?; ((([a-zA-Z]+(\.|;|:))+) )+?\[%{WORD:collection}\].+?path=%{DATA:endpoint} params=\{%{DATA:params}\} hits=%{INT:hits} status=%{INT:status} QTime=%{INT:queryTime}

The next section handles ERROR level messages:

# Error event implies stack track, which requires multiline parsing
  if "ERROR" in [level] {
    multiline {
      pattern => "^\s"
      what => "previous"
      add_tag => [ "multiline_pre" ]
    }
    multiline {
        pattern => "^Caused by"
        what => "previous"
        add_tag => [ "multiline_post" ]
    }
    if "multiline_post" in [tags] {
      grok {
        match => {
          "message" => ".+?; ((([a-zA-Z]+(\.|;|:))+) )+%{DATA:reason}(\n\t)((.+?Caused by: ((([a-zA-Z]+(\.|;|:))+) )+)%{DATA:reason}(\n\t))+"
        }
        tag_on_failure => []
      }
    }
  }

Given a stack trace, there’s a few in the example log file, this stage first combines all the lines of the stack trace into a single message. It then extracts the first and the last causes. The assumption being the first message is the high level failure message and the last one the actual underlying cause.

Finally, I drop any empty lines and clean up temporary tags:

# Remove intermediate tags, and multiline added randomly by multiline stage
  mutate {
      remove_tag => [ "multiline_pre", "multiline_post", "multiline" ]
  }
  # Drop empty lines
  if [message] =~ /^\s*$/ {
    drop {}
  }

To check you have succesfully processed your Solr logs, open up the Sense plugin and run this query:

# aggregate on level
GET solr-*/_search
{
  "query": {
    "match_all": {}
  },
  "size": 10,
  "aggs": {
    "action": {
      "terms": {
        "field": "level",
        "size": 10
      }
    }
  }
}

You should get back all your processed log events along with an aggregation on event severity.

Conclusion

Solr logs contain a great deal of useful information. With the ELK stack you can extract, store, analyse and visualise this data. I hope I’ve given you some helpful tips on how to start doing so! If you run into any problems, please get in touch in the comments below.

Solr: Indexing SQL databases made easier! – Part 2

Seb Muller — Tue, 14 Apr 2015 12:56:21 +0000

Last summer I wrote a blog post about indexing a MySQL database into Apache Solr. I would like to now revisit the post to update it for use with Solr 5 and start diving into how to implement some basic search features such as

Facets
Spellcheck
Phonetic search
Query Completion

Setting up our environment

The requirements remain the same as with the original blogpost:

Java 1.7 or greater
A MySQL database
A copy of the sample employees database
The MySQL jdbc driver

We’ll now be using Solr 5, which runs a little differently from previous incarnations of Solr. Download Solr and extract it to a directory of your choice.Open a terminal and navigate to your Solr directory.
Start Solr with the command

bin/solr start

To confirm Solr successfully started up, run

bin/solr status

Unlike previously, we now need to create a Solr core for our employee data. To do so run this command

bin/solr create_core -c employees -d basic_configs

. This will create a core named employees using Solr’s minimal configuration options. Try

bin/solr create_core -help

to see what else is possible.

Open server/solr/employees/conf/solrconfig.xml in a text editor and add the following within the config tags:
```
 


db-data-config.xml
```
In the same directory, open schema.xml and add this this line:
Create a lib subdir in server/solr/employees and extract the MySQL jdbc driver jar into it.
Finally, restart the Solr server with the command
```
bin/solr restart
```

When started this way, Solr runs by default on port 8983. Use

bin/solr start -p portnumber

and replace portnumber with your preferred choice to start it on that one.

Navigate to http://localhost:8983/solr and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select our employee core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of

Can't find resource 'db-data-config.xml' in classpath

. This is normal as we haven’t actually created this file yet, which stores the configs for connecting to our target database.

We’ll come back to that file later but let’s make our demo database now. If you haven’t already downloaded the sample employees database and installed MySQL, now would be a good time!

Setting up our database

Please refer to the instructions in the same section in the original blog post. The steps are still the same.

Indexing our database

Again, please refer to the instructions in the same section in the original blog post. The only difference is the Postman collection should be imported from this url instead. The commands you can use alternatively have also changed and are now

Clear index: http://localhost:8983/solr/employees/update?stream.body=*:*&commit=true
Retrieve all: http://localhost:8983/solr/employees/select?q=*:*&omitHeader=true
Index db: http://localhost:8983/solr/employees/collection1/dataimport?command=full-import
Reload core: http://localhost:8983/solr/employees/admin/cores?action=RELOAD&core=collection1
Georgi query: http://localhost:8983/solr/employees/select?q=georgi&wt=json&qf=first_name%20last_name&defType=edismax
Facet query: http://localhost:8983/solr/employees/select?q=*:*&wt=json&facet=true&facet.field=dept_s&facet.field=title_s&facet.mincount=1&rows=0
Gorgi spellcheck: http://localhost:8983/solr/employees/select?q=gorgi&wt=json&qf=first_name&defType=edismax
Georgi Phonetic: http://localhost:8983/solr/employees/select?q=georgi&wt=json&qf=first_name%20last_name%20phonetic&defType=edismax

The next step

We should now be back where we ended with the original blog post. So far we have successfully

Setup a database with content
Indexed the database into our Solr index
Setup basic scheduled delta reindexing

Let’s get started with the more interesting stuff!

Facets

Facets, also known as filters or navigators, allow a search user to refine and drill down through search results. Before we get started with them, we need to update our data import configuration. Replace the contents of our existing db-data-config.xml with:

select e.emp_no as 'id', e.birth_date,
(
select t.title
order by t.`from_date` desc
limit 1
) as 'title_s', e.first_name, e.last_name, e.gender as 'gender_s', d.`dept_name` as 'dept_s'
from employees e
join dept_emp de on de.`emp_no` = e.`emp_no`
join departments d on d.`dept_no` = de.`dept_no`
join titles t on t.`emp_no` = e.`emp_no`
group by e.`emp_no`
limit 1000;

To be able to facet, we need appropriate fields upon which to actually facet. Our new SQL retrieves additional fields such as employee titles and departments. Fields perfect for use as facets.

You’ll notice we map title, gender and dept_name to title_s, gender_s and dept_s respectively. This allows us to take advantage of an existing dynamic field mapping in Solr’s default basic config, *_s. A dynamic field allows us to assign all fields with a certain pre/suffix the same field type. In this case, given the field type

, any fields ending with _s will be indexed and stored as basic strings. Solr will not tokenise them and modify their contents. This allows us to safely use them for faceting without worrying about department titles being split on white spaces for example.

Clear the index and restart Solr.
Once Solr has restarted, reindex the database with
our new SQL. Don’t be alarmed if this takes a bit longer
than previously. It’s a bit more heavy weight and not
very well optimised!
Once it’s done indexing, we can
confirm it was successful by running the facet query via
Postman or directly in our browser.
We should see two hits for the query “georgi” along with
facets for their respective titles and department.

The anatomy of a facet query

Let’s take a closer look at the relevant request parameters of our facet query:

http://localhost:8983/solr/employees/select?q=georgi&wt=json&qf=first_name%20last_name&defType=edismax&omitHeader=true&facet=true&facet.field=dept_s&facet.field=title_s&facet.mincount=1

facet – Tells Solr to enable or prevent faceting. Accepted values include yes,on and true to enable, no, off and false to disable
facet.field – Which field we want to facet on, can be defined multiple times
facet.mincount – The minimum number of values for a particular facet value the query results includes for it to be included in the facet result object. Can be defined per facet field with this syntax f.fieldName.facet.mincount=1

There are many other facet parameters. I recommend taking a look at the Solr wiki pages on faceting and other possible parameters.

Spellcheck

Analysing query logs and focusing on those queries that gave zero hits is a quick and easy way to see what can and should be done to improve your search solution. More often than not you will come across a great deal of spelling errors. Adding spellcheck to a search solution gives such great value for a tiny bit of effort. This fruit is so low hanging it should hit you in the face!

To enable spellcheck, we need to make some configuration changes.

In our schema.xml, add these two lines after the *_name dynamic field type we added earlier:
A copyField checks for fields whose names match the pattern defined in source and copies their destinations to the dest field. In our case, we will copy content from first_name and last_name to spellcheck. We then define the spellcheck field as multiValued to handle its multiple sources.
Add the following to our solrconfig.xml:
```
text_general


default
spellcheck
solr.DirectSolrSpellChecker

internal

0.5

2

1

5

4

0.01
```
This will create a spellchecker component that uses the spellcheck field as its dictionary source. The spellcheck field contains content copied from both first and last name fields.
In the same file, look for the select requestHandler and update it to include the spellcheck component:
```
explicit
10
on
default



spellcheck
```

The defaults list in a requestHandler defines which default parameters to add to each request made using the chosen request handler. You could, for example, define which fields to query. In this case we’re enabling spellcheck and using the default dictionary as defined in our solrconfig.xml. All values in the defaults list can be overwritten per request. To include request parameters that cannot be overwritten, we would need to use an invariants list instead:


edismax

Both lists can be used simultaneously. When duplicate keys are present the values in the invariants list will take precedence.

Once we’ve made all our configuration changes, let’s restart Solr and reindex. To verify the changes worked, do a basic retrieve all query and check the resulting documents for the spellcheck field. Its contents should be the same as the document’s first_name and last_name fields.

Because we have enabled spellcheck by default, queries with possible suggestions will include contents in the spellcheck response object.

Try the Gorgi spellcheck query and experiment with different queries. To query the last_name field as well, change the qf parameter to

qf=first_name last_name

The qf parameter defines which fields to use as the search domain.

When the spellcheck response object has content, you can easily use it to implement a basic “did you mean” feature. This will vastly improve your zero hit page.

Phonetic Search

Now that we have a basic spellcheck component in place, the next best feature that easily creates value in a people search system is phonetics. Solr ships with some basic phonetic tokenisers. The most commonly used out of the box phonetic tokeniser is the DoubleMetaphoneFilterFactory. It will suffice for most use cases. It does, however, have some weaknesses, which we will go into briefly in the next section.

We need to once again modify our schema.xml to take advantage of Solr’s phonetic capabilities. Add the following:

Similar to spellcheck, we copy contents from the name fields into a phonetic field. Here we define a phonetic field, whose values will not be stored as we don’t need to return them in search results. It is, however, indexed so we can actually include it in the search domain. Finally, like spellcheck, it is multivalued to handle multiple potential sources. The reason we create an additional search field is so we can apply different weightings to exact matches and phonetic matches.

Restart Solr, clear the index and reindex.

Running the Georgi Phonetic search request should now returns hits based on exact and phonetic matches. To ensure that exact matches are ranked higher, we can add a query time boost to our query fields:

&qf=first_name last_name phonetic^0.5

Rather than apply boosts to fields we want to rank higher, it’s usually simpler to apply a punitive boost to fields we wish to rank lower. Replace the qf parameter in the Georgi Phonetic request and see how the first few results all have an exact match for georgi in the first_name field.

Query Analysis

As we look further down the result set, you will notice some strange matches. One employee, called Kirk Kalsbeek, is apparently a match for “georgi”. To understand why this is a match, we can use Solr’s analysis tool.

It allows use to define an indexed value, a query value and the field type to use and then demonstrate how each value is tokenised and whether or not the query would result in a match.

With the values Kirk Kalsbeek, georgi and phonetic respectively, the analysis tool shows us that Kirk gets tokenised to KRK by our phonetic field type. Georgi is also tokenised to KRK, which results in a match.

To create a better phonetic search solution, we would have to implement a custom phonetic tokeniser. I came across an example, which has helped me enormously in improving phonetic search for Norwegian names on a project.

Conclusion

We should now be able to

Implement index field based spellcheck
Use basic faceting
Implement Solr’s out of the box phonetic capabilities

Query completion I will leave for the next time. I promise you won’t have to wait as long between posts as last time :)

Let me know how you get on in the comments below!

Solr As A Document Processing Pipeline

Seb Muller — Fri, 16 Jan 2015 10:40:48 +0000

Recently on a project I got an interesting request. Content owners wanted to enrich new documents submitted to the search index with content from documents already present in the index. We use Solr as the search backend for this particular customer so I started thinking about how to achieve this with Solr.

A bit of Solr background

Solr ships with all the tools and features necessary for an advanced search solution. These include the oft overlooked update request processors. They operate at the document level i.e. prior to individual field tokenisation and allow you to clean, modify and/or enrich incoming documents. Processing options include language identification, duplicate detection and HTML markup handling. Create a chain of them and you have a true document processing pipeline.

The Solr wiki includes a brief entry on the topic with an example of a custom processor that conditionally adds the field “cat” with value “popular”. The full list of UpdateRequestProcessor factories is available via the Solr Start project.

Back to the initial request

Certain incoming documents would contain a field, topicRef for example, with a reference to one or more documents already present in the index. The referenced documents could either contain a subsequent reference or content that we wanted to add to the incoming document.

I needed a mechanism to retrieve any referenced documents, traverse a tree of subsequently referenced documents if necessary, and then map the eventual leaf documents’ specified content fields to additional new fields in the incoming document.

I created a recursive document enrichment processor to do just that!

Its settings allow for multiple potential field retrievals and mappings, local and foreign key field definitions and the option to retrieve content from a remote Solr index.

A minor drawback of the current iteration of the processor is a high reliance on the existence of referenced documents i.e. if the referenced documents are not already present in the index then the processor will skip over them. To ensure documents are fully enriched, especially if the referenced documents are included in the same indexing batch, reindexes of incoming documents is necessary unless explicitly defining the document indexing order.

In addition, when a referenced document is updated, content owners expect this to have an impact on the content of the parent document and therefore a user’s search experience. This is currently not the case as parent documents are unaware of their child documents beyond the indexing process.

I’m now thoroughly enjoying tackling these issues and working on the next iteration of this RecursiveMergeExistingDoc processor!

Update – 06/02/15

The source code is now available on github

Solr: Indexing SQL databases made easier!

Seb Muller — Thu, 28 Aug 2014 12:05:17 +0000

Update

Part two is now available here!

At the beginning of this year Christopher Vig wrote a great post about indexing an SQL database to the internet’s current search engine du jour, Elasticsearch. This first post in a two part series will show that Apache Solr is a robust and versatile alternative that makes indexing an SQL database just as easy. The second will go deeper into how to make leverage Solr’s features to create a great backend for a people search solution.

Solr ships with a configuration driven contrib called the DataImportHandler. It provides a way to index structured data into Solr in both full and incremental delta imports. We will cover a simple use case of the tool i.e. indexing a database containing personnel data to form the basis of a people search solution. You can also easily extend the DataImportHandler tool via various APIs to pre-process data and handle more complex use cases.

For now, let’s stick with basic indexing of an SQL database.

Setting up our environment

Before we get started, there are a few requirements:

Java 1.7 or greater
For this demo we’ll be using a MySQL database
A copy of the sample employees database
The MySQL jdbc driver

With that out of the way, let’s get Solr up and running and ready for database indexing:

Download Solr and extract it to a directory of your choice.
Open solr-4.9.0/example/solr/collection1/conf/solrconfig.xml in a text editor and add the following within the config tags:
In the same directory, open schema.xml and add this this line
Create a lib subdir in solr-4.9.0/solr/collection1/ and extract the MySQL jdbc driver jar into it. It’s the file called mysql-connector-java-{version}-bin.jar
To start Solr, open a terminal and navigate to the example subdir in your extracted Solr directory and run java -jar start.jar

When started this way, Solr runs by default on port 8983. If you need to change this, edit solr-4.9.0/example/etc/jetty.xml and restart Solr.

Navigate to http://localhost:8983/solr and you should see the Solr admin GUI splash page. From here, use the Core Selector dropdown button to select the default core and then click on the Dataimport option. Expanding the Configuration section should show an XML response with a stacktrace with a message along the lines of Can't find resource 'db-data-config.xml' in classpath. This is normal as we haven’t actually created this file yet, which stores the configs for connecting to our target database.

We’ll come back to that file later but let’s make our demo database now. If you haven’t already downloaded the sample employees database and installed MySQL, now would be a good time!

Setting up our database

Assuming your MySQL server is installed and running, access the MySQL terminal and create the empty employees database: create database employees;

Exit the MySQL terminal and import the employees.sql into your empty database, ensuring that you carry out the following command from the same directory as the employees.sql file itself: mysql -u root -p employees < employees.sql

You can test this was successful by logging into the MySql server and querying the database, as shown here on the right.

Having successfully created and populated your employee database, we can now create that missing db-data-config.xml file.

Indexing our database

In your Solr conf directory, which contains the schema.xml and solrconfig.xml we previously modified, create a new file called db-data-config.xml.

Its contents should look like the example below. Make sure to replace the user and password values with yours and feel free to modify or remove the limit parameter. There’s approximately 30’000 entries in the employees table in total

We’re now going to make use of Solr’s REST-like HTTP API with a couple of commands worth saving. I prefer to use the Postman app on Chrome and have created a public collection of HTTP requests, which you can import into Postman’s Collections view using this url: https://www.getpostman.com/collections/9e95b8130556209ed643

For those of you not using Chrome, here are the commands you will need: First let’s reload the core so that Solr is
aware of the new db-data-config.xml file we have created.
Next, we index our database with the HTTP request or from within the Solr Admin GUI on the DataImport page.

Here we have carried out a full index of our database using the full-import command parameter. To only retrieve changes since the last import, we would use delta-import instead.

We can confirm that our database import was successful by querying our index with the “Retrieve all” and “Georgi query” requests.

Finally, to schedule reindexing you can use a simple cronjob. This one, for example, will run everyday at 23:00 and retrieve all changes since the previous indexing operation:

Conclusion

So far we have successfully

Setup a database with content
Indexed the database into our Solr index
Setup basic scheduled delta reindexing

In the next part of this two part series we will look at how to process our indexed data. Specifically, with a view to making a good people search solution. We will implement several features such as phonetic search, spellcheck and basic query completion. In the meantime, let’s carry on the conversation in the comments below!