FAST ESP Advanced Linguistics - Part 3/3 - Search NuggetsSearch Nuggets

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Part 1: Tokenization and Character Normalization
Part 2: Phonetic Normalization
Part 3: Lemmatization

Lemmatization

Note: The following was done for a directory service in Norway and Sweden, hence the names and language resources used in the examples.

How it works (lemmatization by document expansion)

Lemmatization is the process of identifying a word by its base form. (It is more advanced than “stemming”, which simply chops off endings.) Lemmatization by document expansion identifies a lemma during document/content processing and expands the field to include all expanded variants for the lemma. When lemmatizaion is turned on for a query, it will rewrite to search a lemmatized shadow version of the composite field instead (called “lemcompositename”).

Expansion during document processing

The content processing pipeline’s Lemmatization stage is automatically configured to process index profile fields marked with the attribute lemmatize="yes", and creates shadow variants of a composite marked with the attribute lemmas="yes". The file LemmatizationConfig.xml specifies for which languages lemmatization should occur and that it is done by document_expansion.

<standard_lemmatizer language="no" alt="nn,nb" mode="document_expansion" active="yes">
    <lemmas active="yes" parts_of_speech="NA" />
</standard_lemmatizer>

<standard_lemmatizer language="sv" alt="se" mode="document_expansion" active="yes">
    <lemmas active="yes" parts_of_speech="NA" />
</standard_lemmatizer>

<standard_lemmatizer language="no" alt="nn,nb" mode="document_expansion" active="yes">

</standard_lemmatizer>

<standard_lemmatizer language="sv" alt="se" mode="document_expansion" active="yes">

</standard_lemmatizer>

The language is specified with a iso 639 two-letter code and other accepted codes in the “alt” attribute. Language, parts_of_speech and mode together refer to the lemmatization automaton used for this language, residing in the lemmatization automaton directory:

~/esp/etc/resources/dictionaries/lemmatization/no_NA_exp.aut

The pipeline stage Lemmatizer(webcluster) must follow after the stage Tokenizer(webcluster).

Query transformation

It is necessary (in order to be able to do lemmatized searches using by setting the property instead of simply prefixing the composite field with “lem”) to have lemmatization enabled in the query pipeline. The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

<instance-ref name="lemmatizer"/>

After editing qtf_config.xml, issue this command to activate the changes:

view-admin –m refresh

Usage

The search must be done with the property qtf_lemmatize=true in order to search the lemmatized version automatically, without altering the query, or it can be searched directly without this property if addressing the lemma enabled composite field with a “lem” prefix.

Relevance note:

When searching with qtf_lemmatize=true, you have no way to differentiate rank contribution from the exact hit and hit in a lemmatized version. Without the property set (or set to false), you can reduce the weight of the expanded version like this:

and(
    namecomp:string("mysearchword", weight=100)
    lemnamecomp:string("mysearchword", weight=10)
)

and(

namecomp:string("mysearchword", weight=100)

lemnamecomp:string("mysearchword", weight=10)

)

Search Nuggets by

Blogging about search as the solution