Search Nuggets » Hans Terje Bakke

FAST ESP Advanced Linguistics — Part 0/3

Hans Terje Bakke — Wed, 25 Jun 2014 15:58:34 +0000

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Part 1: Tokenization and Character Normalization
Part 2: Phonetic Normalization
Part 3: Lemmatization

FAST ESP Advanced Linguistics — Part 3/3

Hans Terje Bakke — Wed, 25 Jun 2014 15:58:07 +0000

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Part 1: Tokenization and Character Normalization
Part 2: Phonetic Normalization
Part 3: Lemmatization

Lemmatization

Note: The following was done for a directory service in Norway and Sweden, hence the names and language resources used in the examples.

How it works (lemmatization by document expansion)

Lemmatization is the process of identifying a word by its base form. (It is more advanced than “stemming”, which simply chops off endings.) Lemmatization by document expansion identifies a lemma during document/content processing and expands the field to include all expanded variants for the lemma. When lemmatizaion is turned on for a query, it will rewrite to search a lemmatized shadow version of the composite field instead (called “lemcompositename”).

Expansion during document processing

The content processing pipeline’s Lemmatization stage is automatically configured to process index profile fields marked with the attribute lemmatize="yes", and creates shadow variants of a composite marked with the attribute lemmas="yes". The file LemmatizationConfig.xml specifies for which languages lemmatization should occur and that it is done by document_expansion.

The language is specified with a iso 639 two-letter code and other accepted codes in the “alt” attribute. Language, parts_of_speech and mode together refer to the lemmatization automaton used for this language, residing in the lemmatization automaton directory:

~/esp/etc/resources/dictionaries/lemmatization/no_NA_exp.aut

The pipeline stage Lemmatizer(webcluster) must follow after the stage Tokenizer(webcluster).

Query transformation

It is necessary (in order to be able to do lemmatized searches using by setting the property instead of simply prefixing the composite field with “lem”) to have lemmatization enabled in the query pipeline. The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

After editing qtf_config.xml, issue this command to activate the changes:

view-admin –m refresh

Usage

The search must be done with the property qtf_lemmatize=true in order to search the lemmatized version automatically, without altering the query, or it can be searched directly without this property if addressing the lemma enabled composite field with a “lem” prefix.

Relevance note:

When searching with qtf_lemmatize=true, you have no way to differentiate rank contribution from the exact hit and hit in a lemmatized version. Without the property set (or set to false), you can reduce the weight of the expanded version like this:

and(
    namecomp:string("mysearchword", weight=100)
    lemnamecomp:string("mysearchword", weight=10)
)

FAST ESP Advanced Linguistics — Part 2/3

Hans Terje Bakke — Wed, 25 Jun 2014 15:40:13 +0000

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

For the individual parts, see

Part 1: Tokenization and Character Normalization
Part 2: Phonetic Normalization
Part 3: Lemmatization

Phonetic Normalization

Note: The following was done for a directory service, hence the names in the examples used.

Key info

index profile:	`~/esp/index-profiles/index-profile.xml`
query pipeline config:	`~/esp/etc/config_data/QRServer/webcluster/etc/qrserver/qtf_config.xml`
normalization config:	`~/esp/etc/phonetic/my_phonetics.xml`
content pipeline stages:	`myPhoneticNormalizer (PhoneticNormalizer)`

How it works

Phonetic transformations are performed during content processing onto dedicated fields in the index. The query is transformed in the query transformation pipeline into phonetic versions and searched against the phonetic index fielfs.

How character combinations are transformed into phonetic codes is specified in the phonetic normalization config file (my_phonetics.xml).

Index transformation

Phonetic transformations are performed on selected fields from the index profile and stored in specified fields in the search index. Field mapping is specified in the content pipeline stage like this:

default:etc/phonetic/my_phonetics.xml:name:phonname
default:etc/phonetic/my_phonetics.xml:industrynames:phonindustrynames
default:etc/phonetic/my_phonetics.xml:streetname:phonstreetname

The phonetic fields are included in certain composite fields in the index profile. These fields typically have low context weight.

(Note: The Phonetic Normalizer pipeline stage has to come before the Tokenizer and Lemmatizer, making it impossible to get phonetic transformations of the lemmatized variants for lemmatization by document expansion.)

Query transformation

The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

The instance configuration looks like this:

The "languages" are specified in iso-639 two-letter codes. Language value "no.10" is one of transformation, here with the "10" referring to using context fields of weight 10, but it could be any string (I think!). The "no.10.configuration" refers to the file with the phonetical normalization configuration. (There could for example be another for a "no.20", with less tolerance.)

"fields" state which query fields that are subject to phonetic transformations in the query pipeline (actually only which map name), and the "fieldname.language.strength.map" values state which query scope (field or composite field) should be mapped to which index scope.

Example with the above specification: A query of

namecomp:something

would rewrite the query to also search for the phonetic transformation "2o5eti5" in the mapped field, like this:

any(namecomp:something, phonnamecomp:2o5eti5);

After editing qtf_config.xml, issue this command to activate the changes:

view-admin –m refresh

Usage

Simply search like you would normally do. The query pipeline rewrites the query for you.

FAST ESP Advanced Linguistics — Part 1/3

Hans Terje Bakke — Wed, 25 Jun 2014 14:57:21 +0000

This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.

Part 1: Tokenization and Character Normalization
Part 2: Phonetic Normalization
Part 3: Lemmatization

Tokenization and Character Normalization

Key info

index profile:	`~/esp/index-profiles/index-profile.xml`
config:	`~/esp/etc/tokenizer/tokenization.xml`
content pipeline stages:	`Tokenizer(webcluster)` (automatically generated)

How it works

Tokenization is the process of splitting a string into searchable tokens. This is done both to create the index and to identify which parts of the query to consider a token when searching. (E.g. one tokenization could consider “13.2″ one token, while another could consider it “13″ and “2″, so that a search for “13″ alone
would produce a hit.)

Character Normalization is a transformation of certain characters into certain codes. (E.g. transforming “ø”, “ö” and “oe” into “ø”, so that both “ørjan” and “örjan” would produce hits in the other.) Character Normalization can be set up using content and query pipeline stages or through the tokenizer. It does not alter the available text in the index field. (I.e. name:”andré” still looks like that even though it is normalized and searchable as “andre”.)

Configuration

The character normalization is set up to do lowercasing and accent removal and a few safe normalizations such as “ï” to “i”. Most notably it normalizes “aa” into “å”. The earlier translation of Swedish into Norwegian characters has been removed.
If we want this back then we will have to

Set up the normalizations in the skip-set in tokenization.xml.
Alter all lemmatization dictionaries to consider the new normalizations and recompile them.
Alter the file ~/esp/etc/character_normalization.xml accordingly. This is used by the normalizing completion matchers and the generation script for their automatons. Then regenerate all those automatons.

Document processing

The content pipeline needs the stage Tokenizer(webcluster") in order to do tokenization and character normalization. (Lemmatization should come after this stage.)

Default tokenization is “delimiters”, but in order to activate the special tokenization in tokenization.xml (such as “aa” => “å”), one needs to set the attribute tokenize="auto" on each field in the index profile. The composite fields should also have query-tokenize="auto" set to activate the tokenizations set on each field.

Character normalization is carried out regardless of what kind of tokenization is specified.

Query tokenization

The query transformation configuration file is qtf_config.xml. The pipeline in use is scopesearch. It should contain the entry

After editing qtf_config.xml, issue this command to activate the changes:

view-admin -m refresh

Usage

Simply search as normal. Note that punctuations are considered delimiters, so unless special tokenization is set up for certain fields (requiring at least non-default content pipeline Tokenizer stage(s)), a search for the “13″ against a string field containing “13.2″ would require a field with boundary-match=”yes” and an FQL using the equals operator instead of “string”. (That is, unless you aslo want a hit in “13.1″.)

Beware!

Changes to the index profile are reflected in the document processing and query pipelines regarding which fields to apply the tokenization (etc) to. So make sure you do not overwrite these changes from deployment scripts (such as espdeploy) without first applying the changes to the deployment source tree.

Using Parent/Child Relationships for Document Security in Elasticsearch

Hans Terje Bakke — Wed, 18 Jun 2014 09:15:42 +0000

Problem description

Our documents have a huge content part with text, images, title, author, etc. that rarely or never changes.

Other metadata fields can change more often and are only used for filtering, not relevancy scoring. The example meta data I will use here is a ‘classification’.

When the classification or other status changes, we would prefer not to have to refeed and reindex the entire document. This would cause Elasticsearch to invalidate the old version by marking it as deleted and create a new version, causing a lot of disk space to be used, as well as having to refeed a lot of unchanged data again.

Solution

One way to solve this is by utilising the new parent/child relationship feature that came with Elasticsearch 1.0. We can split a document in two parts, one containing the big static payload, and one containing the metadata we want to use for document security filtering. One will be the parent and one will be the child, and since we only have one of each in this scenario, it doesn’t matter which is which, I will chose the metadata as the parent and the main content part as the child in this example.

Say we have two confidentiality classification levels: ‘public’ and ‘secret’, and our documents contain a title and a body. Let’s call the different parts of our document meta and content. What could have been one document with three fields

one_doc: classification, title, body

will now be split into

meta: classification
content: title, body

First we need to create an index and a mapping where we set up the parent/child relationship between our two document parts as two individual types. (Here using Marvel/Sense syntax for readability. You can always use cURL if you don’t have Sense.)

POST /dsdemo
{
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "meta": {
            "properties": {
                "classification": { "type": "string" }
            }
        },
        "content": {
            "_parent": { "type": "meta" },
            "properties": {
                "title": { "type": "string" },
                "body" : { "type": "string" }
            }
        }
    }
}

Now let’s create some of the main content documents. We will use the same ID for the content document part and the meta document part. (This is not a requirement, it simply makes it easier to keep track of the related parts this way.) The parent ID that we refer to here is the ID of the parent document, of the parent document type. These documents do not yet exist, and they don’t have to. So I create the content documents here first just to emphasise that point.

POST /dsdemo/content/1?parent=1
{
    "title": "The first document",
    "body": "This could be huge #1"
}
POST /dsdemo/content/2?parent=2
{
    "title": "The second document",
    "body": "This could be huge #2"
}
POST /dsdemo/content/3?parent=3
{
    "title": "The third document",
    "body": "This could be huge #3"
}
POST /dsdemo/content/4?parent=4
{
    "title": "The fourth document",
    "body": "This could be huge #4"
}

Now let’s create the parent documents that holds the classification field. Start by setting them all to public:

POST /dsdemo/meta/1
{
    "classification": "public"
}
POST /dsdemo/meta/2
{
    "classification": "public"
}
POST /dsdemo/meta/3
{
    "classification": "public"
}
POST /dsdemo/meta/4
{
    "classification": "public"
}

Then see what happens when we search for all ‘public’ documents containing the word ‘huge’:

GET /dsdemo/content/_search?q=huge
{
    "filter": {
        "has_parent": {
            "parent_type": "meta",
            "query": {
                "term": {"classification": "public"}
            }
        }
    }
}

This reveals all four documents.

Now let’s change the classification for the first document:

POST /dsdemo/meta/1
{
    "classification": "secret"
}

Search again, and there should now be only three public documents.

Closing notes

Use numeric/enumerated classification levels instead, so we can more easily return all below a certain level.
Another document security dimension could be an access control list consisting of user names or user groups.
It seems unnecessary that we have to specify “parent_type”: “meta” in the query, as this is already set up in the mapping. But imagine if you did not search for only content type documents, but rather any document type; then it would be needed.

Fast ESP 5.3 index profile oddities

Hans Terje Bakke — Wed, 10 Nov 2010 15:27:42 +0000

The following mentions a few nuissances in the Fast ESP 5.3 index profile.

Composite field context weight sums
Weights are not relative to its containing element. I.e. the field-weights within the context part of a composite-rank do not sum up to a 100% which most people find intuitive. If you have two fields weighted 100 and 100 and get a hit in both fields, then hit in the composite becomes 200, instead of 100 as one could expect from a “100% hit”.

The single-field-composite warning
A search on non-composite fields do not generate dynamic rank. For this you will need to wrap it in a composite field, which is perfectly ok. However, when you bliss (upload) an index profile, ESP will spew out warnings. These can be ignored. (But remember to add the composite field reference to the rank profile, as usual, or it will have no effect.)

The context/occurence oddity
An often encountered problem is a composite with fields that have values contain the same tokens (words). A search will then get context hits in multiple fields and get rank contributions from each. However, in most cases you would only like to have a rank contribution from the hit in the highest weighted field. I have tried to turn off whatever I could find in config files, but not been able to solve this problem. There are however a few ways around this.

One involves ripping out duplicate words from the lower weighted fields during document processing, but that makes those fields useless for other purposes.

Another solution involves splitting the fields into singular composite fields and rewriting the query to contain a lot of parts searching each field with the same words, and joining them with the ANY operator.

A third solution is join the fields in question in a field-ref-group in the composite field. This will count multiple field hits as a hit the one group and assign a rank contribution according to the field-ref-group’s weight. But you will no longer be able to assign individual weights to each field.

The default oddity
One composite field must be tagged with

default="yes"

If you have no default composite field, ESP can start to act funny, such as swapping int32 and string types in the result(!)

The quality oddity
This refers to the static boost contribution to the result defined by the “quality” element of the rank-profile.

If the quality element is not present, the default weight is 50, and the default quality field is “hwboost”. This is a magic field that is hard coded and not defined in the index profile. However, try to specify

<quality weight="50" field-ref="hwboost"/>

and you will get an error that the field is not defined. This can seemingly safely be explicitly defined in the index profile as

<field name="hwboost" type="uint32" index="yes" sort="yes"/>

The default start value of hwboost is 10000, and this can be added to or subtracted from during document processing.

The quality weight is limited to steps of 50 (0, 50, 100, 150, …). These values are actually transformed to multipliers 0, 1, 2, … So a weight value of “50″ does not mean half or 50%. With a default hwboost, a weight of “50″ transforms to multiplier 1, i.e. 1*10000=10000.

You can specify your own quality field. It must be of type uint32 (not int32!), index and sorting set to “yes”.

After changing the values, run

bliss-core -C index-profile.xml
view-admin -m refresh

and then wait a minute or so for the views to refresh.