FAST ESP Advanced Linguistics — Part 3/3

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Lemmatization

Note: The following was done for a directory service in Norway and Sweden, hence the names and language resources used in the examples.

How it works (lemmatization by document expansion)

Lemmatization is the process of identifying a word by its base form. (It is more advanced than “stemming”, which simply chops off endings.) Lemmatization by document expansion identifies a lemma during document/content processing and expands the field to include all expanded variants for the lemma. When lemmatizaion is turned on for a query, it will rewrite to search a lemmatized shadow version of the composite field instead (called “lemcompositename”).

Expansion during document processing

The content processing pipeline’s Lemmatization stage is automatically configured to process index profile fields marked with the attribute lemmatize="yes", and creates shadow variants of a composite marked with the attribute lemmas="yes". The file LemmatizationConfig.xml specifies for which languages lemmatization should occur and that it is done by document_expansion.

The language is specified with a iso 639 two-letter code and other accepted codes in the “alt” attribute. Language, parts_of_speech and mode together refer to the lemmatization automaton used for this language, residing in the lemmatization automaton directory:

~/esp/etc/resources/dictionaries/lemmatization/no_NA_exp.aut

The pipeline stage Lemmatizer(webcluster) must follow after the stage Tokenizer(webcluster).

Query transformation

It is necessary (in order to be able to do lemmatized searches using by setting the property instead of simply prefixing the composite field with “lem”) to have lemmatization enabled in the query pipeline. The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

<instance-ref name="lemmatizer"/>

After editing qtf_config.xml, issue this command to activate the changes:

view-admin –m refresh

Usage

The search must be done with the property qtf_lemmatize=true in order to search the lemmatized version automatically, without altering the query, or it can be searched directly without this property if addressing the lemma enabled composite field with a “lem” prefix.

Relevance note:

When searching with qtf_lemmatize=true, you have no way to differentiate rank contribution from the exact hit and hit in a lemmatized version. Without the property set (or set to false), you can reduce the weight of the expanded version like this:

Article written by

Hans Terje Bakke
Hans Terje Bakke is one of Comperio's most experienced and knowledgable senior consultants. Besides his interest in Search technology, Hans Terje is a very enthusiastic game developer and has held senior positions throughout the gaming industry. Hans Terje holds a M.Sc. in Engineering/Computer Science from Norwegian Institute of Technology.


Leave a response





XHTML: These tags are allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">


OSLO

Comperio AS
Øvre Slottsgate 27
NO-0157 Oslo,
Norway
+47 22 33 71 00
View map

STOCKHOLM

Search Provider Sverige AB
Gamla Brogatan 34
SE-11 120 Stockholm
Sweden
+46 8-21 49 00
View map