FAST ESP Advanced Linguistics — Part 2/3
This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.
For the individual parts, see
Part 2: Phonetic Normalization
Part 3: Lemmatization
Phonetic Normalization
Note: The following was done for a directory service, hence the names in the examples used.
Key info
index profile: | ~/esp/index-profiles/index-profile.xml |
query pipeline config: | ~/esp/etc/config_data/QRServer/webcluster/etc/qrserver/qtf_config.xml |
normalization config: | ~/esp/etc/phonetic/my_phonetics.xml |
content pipeline stages: | myPhoneticNormalizer (PhoneticNormalizer) |
How it works
Phonetic transformations are performed during content processing onto dedicated fields in the index. The query is transformed in the query transformation pipeline into phonetic versions and searched against the phonetic index fielfs.
How character combinations are transformed into phonetic codes is specified in the phonetic normalization config file (my_phonetics.xml
).
Index transformation
Phonetic transformations are performed on selected fields from the index profile and stored in specified fields in the search index. Field mapping is specified in the content pipeline stage like this:
default:etc/phonetic/my_phonetics.xml:name:phonname
default:etc/phonetic/my_phonetics.xml:industrynames:phonindustrynames
default:etc/phonetic/my_phonetics.xml:streetname:phonstreetname
The phonetic fields are included in certain composite fields in the index profile. These fields typically have low context weight.
(Note: The Phonetic Normalizer pipeline stage has to come before the Tokenizer and Lemmatizer, making it impossible to get phonetic transformations of the lemmatized variants for lemmatization by document expansion.)
Query transformation
The query transformation configuration file is qtf_config.xml
. The pipeline in use is scopesearch
. It should contain the entry
<instance-ref name="phoneticnormalizer"/>
The instance configuration looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
<instance name="phoneticnormalizer" type="external" resource="qt_phonetic"> <parameter-list name="qt.phonetic"> <parameter name="enable" value="1"/> <parameter name="feedback_on_term_rejection" value="0"/> <parameter name="languages" value="no.10,sv.10"/> <parameter name="no.10.configuration" value="etc/phonetic/my_phonetics.xml"/> <parameter name="sv.10.configuration" value="etc/phonetic/my_phonetics.xml"/> <parameter name="fields" value="namecomp,industrycomp,streetcomp"/> <parameter name="namecomp.no.10.map" value="namecomp:phonnamecomp"/> <parameter name="namecomp.sv.10.map" value="namecomp:phonnamecomp"/> <parameter name="industrycomp.no.10.map" value="industrycomp:phonindustrycomp"/> <parameter name="industrycomp.sv.10.map" value="industrycomp:phonindustrycomp"/> <parameter name="streetcomp.no.10.map" value="streetcomp:phonstreetcomp"/> <parameter name="streetcomp.sv.10.map" value="streetcomp:phonstreetcomp"/> <parameter name="default" value="no"/> <parameter name="defaultweight" value="100"/> </parameter-list> </instance> |
The "languages" are specified in iso-639 two-letter codes. Language value "no.10" is one of transformation, here with the "10" referring to using context fields of weight 10, but it could be any string (I think!). The "no.10.configuration" refers to the file with the phonetical normalization configuration. (There could for example be another for a "no.20", with less tolerance.)
"fields" state which query fields that are subject to phonetic transformations in the query pipeline (actually only which map name), and the "fieldname.language.strength.map" values state which query scope (field or composite field) should be mapped to which index scope.
Example with the above specification: A query of
namecomp:something
would rewrite the query to also search for the phonetic transformation "2o5eti5" in the mapped field, like this:
any(namecomp:something, phonnamecomp:2o5eti5);
After editing qtf_config.xml
, issue this command to activate the changes:
view-admin –m refresh
Usage
Simply search like you would normally do. The query pipeline rewrites the query for you.