FAST ESP Advanced Linguistics — Part 2/3

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Phonetic Normalization

Note: The following was done for a directory service, hence the names in the examples used.

Key info

index profile: ~/esp/index-profiles/index-profile.xml
query pipeline config: ~/esp/etc/config_data/QRServer/webcluster/etc/qrserver/qtf_config.xml
normalization config: ~/esp/etc/phonetic/my_phonetics.xml
content pipeline stages: myPhoneticNormalizer (PhoneticNormalizer)

How it works

Phonetic transformations are performed during content processing onto dedicated fields in the index. The query is transformed in the query transformation pipeline into phonetic versions and searched against the phonetic index fielfs.

How character combinations are transformed into phonetic codes is specified in the phonetic normalization config file (my_phonetics.xml).

Index transformation

Phonetic transformations are performed on selected fields from the index profile and stored in specified fields in the search index. Field mapping is specified in the content pipeline stage like this:

default:etc/phonetic/my_phonetics.xml:name:phonname
default:etc/phonetic/my_phonetics.xml:industrynames:phonindustrynames
default:etc/phonetic/my_phonetics.xml:streetname:phonstreetname

The phonetic fields are included in certain composite fields in the index profile. These fields typically have low context weight.

(Note: The Phonetic Normalizer pipeline stage has to come before the Tokenizer and Lemmatizer, making it impossible to get phonetic transformations of the lemmatized variants for lemmatization by document expansion.)

Query transformation

The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

<instance-ref name="phoneticnormalizer"/>

The instance configuration looks like this:

The "languages" are specified in iso-639 two-letter codes. Language value "no.10" is one of transformation, here with the "10" referring to using context fields of weight 10, but it could be any string (I think!). The "no.10.configuration" refers to the file with the phonetical normalization configuration. (There could for example be another for a "no.20", with less tolerance.)

"fields" state which query fields that are subject to phonetic transformations in the query pipeline (actually only which map name), and the "fieldname.language.strength.map" values state which query scope (field or composite field) should be mapped to which index scope.

Example with the above specification: A query of

namecomp:something

would rewrite the query to also search for the phonetic transformation "2o5eti5" in the mapped field, like this:

any(namecomp:something, phonnamecomp:2o5eti5);

After editing qtf_config.xml, issue this command to activate the changes:

view-admin –m refresh

Usage

Simply search like you would normally do. The query pipeline rewrites the query for you.

Article written by

Hans Terje Bakke
Hans Terje Bakke is one of Comperio's most experienced and knowledgable senior consultants. Besides his interest in Search technology, Hans Terje is a very enthusiastic game developer and has held senior positions throughout the gaming industry. Hans Terje holds a M.Sc. in Engineering/Computer Science from Norwegian Institute of Technology.


Leave a response





XHTML: These tags are allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

301 Moved Permanently

Moved Permanently

The document has moved here.


OSLO

Comperio AS
Øvre Slottsgate 27
NO-0157 Oslo,
Norway
+47 22 33 71 00
View map

STOCKHOLM

Search Provider Sverige AB
Gamla Brogatan 34
SE-11 120 Stockholm
Sweden
+46 8-21 49 00
View map