<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; Hans Terje Bakke</title>
	<atom:link href="http://blog.comperiosearch.com/blog/author/hbakke/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>FAST ESP Advanced Linguistics &#8212; Part 0/3</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics/#comments</comments>
		<pubDate>Wed, 25 Jun 2014 15:58:34 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2639</guid>
		<description><![CDATA[This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization. For the individual parts, see Part 1: Tokenization and Character Normalization Part 2: Phonetic Normalization Part 3: Lemmatization]]></description>
				<content:encoded><![CDATA[<p>This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.<br />
<span id="more-2639"></span></p>
<p>For the individual parts, see</p>
<div style="margin-left:30px">
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/">Part 1: Tokenization and Character Normalization</a><br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/">Part 2: Phonetic Normalization</a><br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/">Part 3: Lemmatization</a>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>FAST ESP Advanced Linguistics &#8212; Part 3/3</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/#comments</comments>
		<pubDate>Wed, 25 Jun 2014 15:58:07 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2707</guid>
		<description><![CDATA[This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization. For the individual parts, see Part 1: Tokenization and Character Normalization Part 2: Phonetic Normalization Part 3: Lemmatization Lemmatization Note: The following was done for a directory service in Norway and Sweden, hence the names [...]]]></description>
				<content:encoded><![CDATA[<p>This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.<br />
<span id="more-2707"></span></p>
<p>For the individual parts, see</p>
<div style="margin-left:30px">
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/">Part 1: Tokenization and Character Normalization</a><br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/">Part 2: Phonetic Normalization</a><br />
Part 3: Lemmatization
</div>
<h2>Lemmatization</h2>
<p>Note: The following was done for a directory service in Norway and Sweden, hence the names and language resources used in the examples.</p>
<h3>How it works (lemmatization by document expansion)</h3>
<p>Lemmatization is the process of identifying a word by its base form. (It is more advanced than &#8220;stemming&#8221;, which simply chops off endings.) Lemmatization by document expansion identifies a lemma during document/content processing and expands the field to include all expanded variants for the lemma. When lemmatizaion is turned on for a query, it will rewrite to search a lemmatized shadow version of the composite field instead (called &#8220;lemcompositename&#8221;).</p>
<h3>Expansion during document processing</h3>
<p>The content processing pipeline&#8217;s Lemmatization stage is automatically configured to process index profile fields marked with the attribute <code>lemmatize="yes"</code>, and creates shadow variants of a composite marked with the attribute <code>lemmas="yes"</code>. The file <code>LemmatizationConfig.xml</code> specifies for which languages lemmatization should occur and that it is done by <code>document_expansion</code>.</p>
<p></p><pre class="crayon-plain-tag">&lt;standard_lemmatizer language="no" alt="nn,nb" mode="document_expansion" active="yes"&gt;
    &lt;lemmas active="yes" parts_of_speech="NA" /&gt;
&lt;/standard_lemmatizer&gt;

&lt;standard_lemmatizer language="sv" alt="se" mode="document_expansion" active="yes"&gt;
    &lt;lemmas active="yes" parts_of_speech="NA" /&gt;
&lt;/standard_lemmatizer&gt;</pre><p></p>
<p>The language is specified with a iso 639 two-letter code and other accepted codes in the &#8220;alt&#8221; attribute. Language, parts_of_speech and mode together refer to the lemmatization automaton used for this language, residing in the lemmatization automaton directory:</p>
<div style="margin-left:30px">
<code>~/esp/etc/resources/dictionaries/lemmatization/no_NA_exp.aut</code>
</div>
<p>The pipeline stage <code>Lemmatizer(webcluster)</code> must follow after the stage <code>Tokenizer(webcluster)</code>.</p>
<h3>Query transformation</h3>
<p>It is necessary (in order to be able to do lemmatized searches using by setting the property instead of simply prefixing the composite field with &#8220;lem&#8221;) to have lemmatization enabled in the query pipeline. The query transformation configuration file is <code>qtf_config.xml</code>. The pipeline in use is <code>scopesearch</code>. It should contain the entry</p>
<div style="margin-left:30px">
<code>&lt;instance-ref name="lemmatizer"/&gt;</code>
</div>
<p>After editing <code>qtf_config.xml</code>, issue this command to activate the changes:</p>
<div style="margin-left:30px">
<code>view-admin –m refresh</code>
</div>
<h3>Usage</h3>
<p>The search must be done with the property <code>qtf_lemmatize=true</code> in order to search the lemmatized version automatically, without altering the query, or it can be searched directly without this property if addressing the lemma enabled composite field with a &#8220;lem&#8221; prefix.</p>
<p><u>Relevance note:</u></br></p>
<p>When searching with <code>qtf_lemmatize=true</code>, you have no way to differentiate rank contribution from the exact hit  and hit in a lemmatized version. Without the property set (or set to false), you can reduce the weight of the expanded version like this:</p>
<div style="margin-left:30px">
<pre class="crayon-plain-tag">and(
    namecomp:string("mysearchword", weight=100)
    lemnamecomp:string("mysearchword", weight=10)
)</pre>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>FAST ESP Advanced Linguistics &#8212; Part 2/3</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/#comments</comments>
		<pubDate>Wed, 25 Jun 2014 15:40:13 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2688</guid>
		<description><![CDATA[This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization. For the individual parts, see Part 1: Tokenization and Character Normalization Part 2: Phonetic Normalization Part 3: Lemmatization Phonetic Normalization Note: The following was done for a directory service, hence the names in the examples [...]]]></description>
				<content:encoded><![CDATA[<p>This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.<br />
<span id="more-2688"></span></p>
<p>For the individual parts, see</p>
<div style="margin-left:30px">
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/">Part 1: Tokenization and Character Normalization</a><br />
Part 2: Phonetic Normalization<br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/">Part 3: Lemmatization</a>
</div>
<h2>Phonetic Normalization</h2>
<p>Note: The following was done for a directory service, hence the names in the examples used.</p>
<h3>Key info</h3>
<p><font size="-1"></p>
<table style="margin-left:30px">
<tr>
<td>index profile:</td>
<td><code>~/esp/index-profiles/index-profile.xml</code></td>
</tr>
<tr>
<td>query pipeline config:</td>
<td><code>~/esp/etc/config_data/QRServer/webcluster/etc/qrserver/qtf_config.xml</code></td>
</tr>
<tr>
<td>normalization config:</td>
<td><code>	~/esp/etc/phonetic/my_phonetics.xml</td>
</tr>
<tr>
<td>content pipeline stages:</td>
<td><code>myPhoneticNormalizer (PhoneticNormalizer)</td>
</tr>
</table>
<p></font></p>
<h3>How it works</h3>
<p>Phonetic transformations are performed during content processing onto dedicated fields in the index. The query is transformed in the query transformation pipeline into phonetic versions and searched against the phonetic index fielfs.</p>
<p>How character combinations are transformed into phonetic codes is specified in the phonetic normalization config file (<code>my_phonetics.xml</code>).</p>
<h3>Index transformation</h3>
<p>Phonetic transformations are performed on selected fields from the index profile and stored in specified fields in the search index. Field mapping is specified in the content pipeline stage like this:</p>
<div style="margin-left:45px">
<code>default:etc/phonetic/my_phonetics.xml:name:phonname</code><br />
<code>default:etc/phonetic/my_phonetics.xml:industrynames:phonindustrynames</code><br />
<code>default:etc/phonetic/my_phonetics.xml:streetname:phonstreetname</code>
</div>
<p>The phonetic fields are included in certain composite fields in the index profile. These fields typically have low context weight.</p>
<p>(Note: The Phonetic Normalizer pipeline stage has to come before the Tokenizer and Lemmatizer, making it impossible to get phonetic transformations of the lemmatized variants for lemmatization by document expansion.)</p>
<h3>Query transformation</h3>
<p>The query transformation configuration file is <code>qtf_config.xml</code>. The pipeline in use is <code>scopesearch</code>. It should contain the entry</p>
<div style="margin-left:30px">
<code>&lt;instance-ref name="phoneticnormalizer"/&gt;</code>
</div>
<p>The instance configuration looks like this:</p>
<p></p><pre class="crayon-plain-tag">&lt;instance name="phoneticnormalizer" type="external" resource="qt_phonetic"&gt;
  &lt;parameter-list name="qt.phonetic"&gt;
    &lt;parameter name="enable" value="1"/&gt;
    &lt;parameter name="feedback_on_term_rejection" value="0"/&gt;
    &lt;parameter name="languages" value="no.10,sv.10"/&gt;
    &lt;parameter name="no.10.configuration" value="etc/phonetic/my_phonetics.xml"/&gt;
    &lt;parameter name="sv.10.configuration" value="etc/phonetic/my_phonetics.xml"/&gt;
    &lt;parameter name="fields" value="namecomp,industrycomp,streetcomp"/&gt;
    &lt;parameter name="namecomp.no.10.map" value="namecomp:phonnamecomp"/&gt;
    &lt;parameter name="namecomp.sv.10.map"	value="namecomp:phonnamecomp"/&gt;
    &lt;parameter name="industrycomp.no.10.map" value="industrycomp:phonindustrycomp"/&gt;
    &lt;parameter name="industrycomp.sv.10.map" value="industrycomp:phonindustrycomp"/&gt;
    &lt;parameter name="streetcomp.no.10.map" value="streetcomp:phonstreetcomp"/&gt;
    &lt;parameter name="streetcomp.sv.10.map" value="streetcomp:phonstreetcomp"/&gt;
    &lt;parameter name="default" value="no"/&gt;
    &lt;parameter name="defaultweight" value="100"/&gt;
  &lt;/parameter-list&gt;
&lt;/instance&gt;</pre><p></p>
<p>The "languages" are specified in iso-639 two-letter codes. Language value "no.10" is one of transformation, here with the "10" referring to using context fields of weight 10, but it could be any string (I think!). The "no.10.configuration" refers to the file with the phonetical normalization configuration. (There could for example be another for a "no.20", with less tolerance.)</p>
<p>"fields" state which query fields that are subject to phonetic transformations in the query pipeline (actually only which map name), and the "fieldname.language.strength.map" values state which query scope (field or composite field) should be mapped to which index scope.</p>
<p>Example with the above specification: A query of</p>
<div style="margin-left:30px">
<code>namecomp:something</code>
</div>
<p>would rewrite the query to also search for the phonetic transformation "2o5eti5" in the mapped field, like this:</p>
<div style="margin-left:30px">
<code>any(namecomp:something, phonnamecomp:2o5eti5);</code>
</div>
<p>After editing <code>qtf_config.xml</code>, issue this command to activate the changes:</p>
<div style="margin-left:30px">
<code>view-admin –m refresh</code>
</div>
<h3>Usage</h3>
<p>Simply search like you would normally do. The query pipeline rewrites the query for you.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>FAST ESP Advanced Linguistics &#8212; Part 1/3</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/#comments</comments>
		<pubDate>Wed, 25 Jun 2014 14:57:21 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2643</guid>
		<description><![CDATA[This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization. Part 1: Tokenization and Character Normalization Part 2: Phonetic Normalization Part 3: Lemmatization Tokenization and Character Normalization Key info index profile: ~/esp/index-profiles/index-profile.xml config: ~/esp/etc/tokenizer/tokenization.xml content pipeline stages: Tokenizer(webcluster) (automatically generated) How it works Tokenization is [...]]]></description>
				<content:encoded><![CDATA[<p>This series of blog posts covers how to set up FAST ESP special tokenization, character normalization, phonetic normalization and lemmatization.<br />
<span id="more-2643"></span></p>
<div style="margin-left:30px">
Part 1: Tokenization and Character Normalization<br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-2/">Part 2: Phonetic Normalization</a><br />
<a href="http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-3/">Part 3: Lemmatization</a>
</div>
<h2>Tokenization and Character Normalization</h2>
<h3>Key info</h3>
<p><font size="-1"></p>
<table style="margin-left:30px">
<tr>
<td>index profile:</td>
<td><code>~/esp/index-profiles/index-profile.xml</code></td>
</tr>
<tr>
<td>config:</td>
<td><code>~/esp/etc/tokenizer/tokenization.xml</code></td>
</tr>
<tr>
<td>content pipeline stages:</td>
<td><code>Tokenizer(webcluster)</code>      (automatically generated)</td>
</tr>
</table>
<p></font></p>
<h3>How it works</h3>
<p>Tokenization is the process of splitting a string into searchable tokens. This is done both to create the index and to identify which parts of the query to consider a token when searching. (E.g. one tokenization could consider &#8220;13.2&#8243; one token, while another could consider it  &#8220;13&#8243; and &#8220;2&#8243;, so that a search for &#8220;13&#8243; alone<br />
would produce a hit.)</p>
<p>Character Normalization is a transformation of certain characters into certain codes. (E.g. transforming &#8220;ø&#8221;, &#8220;ö&#8221; and &#8220;oe&#8221; into &#8220;ø&#8221;, so that both &#8220;ørjan&#8221; and &#8220;örjan&#8221; would produce hits in the other.) Character Normalization can be set up using content and query pipeline stages or through the tokenizer. It does not alter the available text in the index field. (I.e. name:&#8221;andré&#8221; still looks like that even though it is normalized and searchable as &#8220;andre&#8221;.)</p>
<h3>Configuration</h3>
<p>The character normalization is set up to do lowercasing and accent removal and a few safe normalizations such as &#8220;ï&#8221; to &#8220;i&#8221;. Most notably it normalizes &#8220;aa&#8221; into &#8220;å&#8221;. The earlier translation of Swedish into Norwegian characters has been removed.<br />
If we want this back then we will have to</p>
<div style="margin-left:30px">
<ol type="a">
<li>Set up the normalizations in the skip-set in tokenization.xml.</li>
<li>Alter all lemmatization dictionaries to consider the new normalizations and  recompile them.</li>
<li>Alter the file <code>~/esp/etc/character_normalization.xml</code> accordingly. This is used by the normalizing completion matchers and the generation script for their automatons. Then regenerate all those automatons.</li>
</ol>
</div>
<h3>Document processing</h3>
<p>The content pipeline needs the stage <code>Tokenizer(webcluster")</code> in order to do tokenization and character normalization. (Lemmatization should come after this stage.)</p>
<p>Default tokenization is &#8220;delimiters&#8221;, but in order to activate the special tokenization in <code>tokenization.xml</code> (such as &#8220;aa&#8221; =&gt; &#8220;å&#8221;), one needs to set the attribute <code>tokenize="auto"</code> on each field in the index profile. The composite fields should also have <code>query-tokenize="auto"</code> set to activate the tokenizations set on each field.</p>
<p>Character normalization is carried out regardless of what kind of tokenization is specified.</p>
<h3>Query tokenization</h3>
<p>The query transformation configuration file is <code>qtf_config.xml</code>. The pipeline in use is </code>scopesearch</code>. It should contain the entry</p>
<div style="margin-left:30px">
<code>&lt;instance-ref name="tokenize" critical="1"/&gt;</code>
</div>
<p>After editing qtf_config.xml, issue this command to activate the changes:</p>
<div style="margin-left:30px">
<code>view-admin -m refresh</code>
</div>
<h3>Usage</h3>
<p>Simply search as normal. Note that punctuations are considered delimiters, so unless special tokenization is set up for certain fields (requiring at least non-default content pipeline Tokenizer stage(s)), a search for the &#8220;13&#8243; against a string field containing &#8220;13.2&#8243; would require a field with boundary-match=&#8221;yes&#8221; and an FQL using the equals operator instead of &#8220;string&#8221;. (That is, unless you aslo want a hit in &#8220;13.1&#8243;.)</p>
<h3>Beware!</h3>
<p>Changes to the index profile are reflected in the document processing and query pipelines regarding which fields to apply the tokenization (etc) to. So make sure you do not overwrite these changes from deployment scripts (such as <code>espdeploy</code>) without first applying the changes to the deployment source tree.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/25/fast-esp-advanced-linguistics-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using Parent/Child Relationships for Document Security in Elasticsearch</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/18/using-parentchild-relationships-document-security-elasticsearch/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/18/using-parentchild-relationships-document-security-elasticsearch/#comments</comments>
		<pubDate>Wed, 18 Jun 2014 09:15:42 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2513</guid>
		<description><![CDATA[Problem description Our documents have a huge content part with text, images, title, author, etc. that rarely or never changes. Other metadata fields can change more often and are only used for filtering, not relevancy scoring. The example meta data I will use here is a &#8216;classification&#8217;. When the classification or other status changes, we [...]]]></description>
				<content:encoded><![CDATA[<h3>Problem description</h3>
<p>Our documents have a huge content part with text, images, title, author, etc. that rarely or never changes.</p>
<p>Other metadata fields can change more often and are only used for filtering, not relevancy scoring. The example meta data I will use here is a &#8216;classification&#8217;.</p>
<p>When the classification or other status changes, we would prefer not to have to refeed and reindex the entire document. This would cause Elasticsearch to invalidate the old version by marking it as deleted and create a new version, causing a lot of disk space to be used, as well as having to refeed a lot of unchanged data again.</p>
<h3>Solution</h3>
<p>One way to solve this is by utilising the new parent/child relationship feature that came with Elasticsearch 1.0. We can split a document in two parts, one containing the big static payload, and one containing the metadata we want to use for document security filtering. One will be the parent and one will be the child, and since we only have one of each in this scenario, it doesn&#8217;t matter which is which, I will chose the metadata as the parent and the main content part as the child in this example.</p>
<p>Say we have two confidentiality <em>classification</em> levels: &#8216;public&#8217; and &#8216;secret&#8217;, and our documents contain a <em>title</em> and a <em>body</em>. Let&#8217;s call the different parts of our document <em>meta</em> and <em>content</em>. What could have been one document with three fields</p><pre class="crayon-plain-tag">one_doc:&nbsp;classification, title, body</pre><p>will now be split into</p><pre class="crayon-plain-tag">meta:&nbsp;classification
content:&nbsp;title, body</pre><p>First we need to create an index and a mapping where we set up the parent/child relationship between our two document parts as two individual types. (Here using Marvel/Sense syntax for readability. You can always use cURL if you don&#8217;t have Sense.)</p><pre class="crayon-plain-tag">POST&nbsp;/dsdemo
{
    &quot;settings&quot;: {
        &quot;number_of_shards&quot;: 1,
        &quot;number_of_replicas&quot;: 0
    },
    &quot;mappings&quot;: {
        &quot;meta&quot;: {
            &quot;properties&quot;: {
                &quot;classification&quot;: { &quot;type&quot;: &quot;string&quot; }
            }
        },
        &quot;content&quot;: {
            &quot;_parent&quot;: { &quot;type&quot;: &quot;meta&quot; },
            &quot;properties&quot;: {
                &quot;title&quot;: { &quot;type&quot;: &quot;string&quot; },
                &quot;body&quot; : { &quot;type&quot;: &quot;string&quot; }
            }
        }
    }
}</pre><p>Now let&#8217;s create some of the main content documents. We will use the same ID for the <em>content</em> document part and the <em>meta</em> document part. (This is not a requirement, it simply makes it easier to keep track of the related parts this way.) The parent ID that we refer to here is the ID of the parent document, of the parent document type. These documents do not yet exist, and they don&#8217;t have to. So I create the content documents here first just to emphasise that point.</p><pre class="crayon-plain-tag">POST&nbsp;/dsdemo/content/1?parent=1
{
    &quot;title&quot;: &quot;The first document&quot;,
    &quot;body&quot;: &quot;This could be huge #1&quot;
}
POST&nbsp;/dsdemo/content/2?parent=2
{
    &quot;title&quot;: &quot;The second document&quot;,
    &quot;body&quot;: &quot;This could be huge #2&quot;
}
POST&nbsp;/dsdemo/content/3?parent=3
{
    &quot;title&quot;: &quot;The third document&quot;,
    &quot;body&quot;: &quot;This could be huge #3&quot;
}
POST&nbsp;/dsdemo/content/4?parent=4
{
    &quot;title&quot;: &quot;The fourth document&quot;,
    &quot;body&quot;: &quot;This could be huge #4&quot;
}</pre><p>Now let&#8217;s  create the parent documents that holds the classification field. Start by setting them all to public:</p><pre class="crayon-plain-tag">POST&nbsp;/dsdemo/meta/1
{
    &quot;classification&quot;: &quot;public&quot;
}
POST&nbsp;/dsdemo/meta/2
{
    &quot;classification&quot;: &quot;public&quot;
}
POST&nbsp;/dsdemo/meta/3
{
    &quot;classification&quot;: &quot;public&quot;
}
POST&nbsp;/dsdemo/meta/4
{
    &quot;classification&quot;: &quot;public&quot;
}</pre><p>Then see what happens when we search for all &#8216;public&#8217; documents containing the word &#8216;huge&#8217;:</p><pre class="crayon-plain-tag">GET&nbsp;/dsdemo/content/_search?q=huge
{
    &quot;filter&quot;: {
        &quot;has_parent&quot;: {
            &quot;parent_type&quot;: &quot;meta&quot;,
            &quot;query&quot;: {
                &quot;term&quot;: {&quot;classification&quot;: &quot;public&quot;}
            }
        }
    }
}</pre><p>This reveals all four documents.</p>
<p>Now let&#8217;s change the classification for the first document:</p><pre class="crayon-plain-tag">POST&nbsp;/dsdemo/meta/1
{
    &quot;classification&quot;: &quot;secret&quot;
}</pre><p>Search again, and there should now be only three public documents.</p>
<h3>Closing notes</h3>
<ul>
<li>Use numeric/enumerated classification levels instead, so we can more easily return all below a certain level.</li>
<li>Another document security dimension could be an access control list consisting of user names or user groups.</li>
<li>It seems unnecessary that we have to specify &#8220;parent_type&#8221;: &#8220;meta&#8221; in the query, as this is already set up in the mapping. But imagine if you did not search for only <em>content</em> type documents, but rather any document type; then it would be needed.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/18/using-parentchild-relationships-document-security-elasticsearch/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Fast ESP 5.3 index profile oddities</title>
		<link>http://blog.comperiosearch.com/blog/2010/11/10/fast-esp-5-3-index-profile-oddities/</link>
		<comments>http://blog.comperiosearch.com/blog/2010/11/10/fast-esp-5-3-index-profile-oddities/#comments</comments>
		<pubDate>Wed, 10 Nov 2010 15:27:42 +0000</pubDate>
		<dc:creator><![CDATA[Hans Terje Bakke]]></dc:creator>
				<category><![CDATA[Technology]]></category>
		<category><![CDATA[fast esp]]></category>
		<category><![CDATA[relevance tuning]]></category>

		<guid isPermaLink="false">http://nuggets.comperiosearch.com/?p=85</guid>
		<description><![CDATA[The following mentions a few nuissances in the Fast ESP 5.3 index profile. Composite field context weight sums Weights are not relative to its containing element. I.e. the field-weights within the context part of a composite-rank do not sum up to a 100% which most people find intuitive. If you have two fields weighted 100 [...]]]></description>
				<content:encoded><![CDATA[<p>The following mentions a few nuissances in the Fast ESP 5.3 index profile.</p>
<p><strong>Composite field context weight sums<br />
</strong>Weights are not relative to its containing element. I.e. the field-weights within the context part of a composite-rank do not sum up to a 100% which most people find intuitive. If you have two fields weighted 100 and 100 and get a hit in both fields, then hit in the composite becomes 200, instead of 100 as one could expect from a &#8220;100% hit&#8221;.</p>
<p><strong>The single-field-composite warning<br />
</strong>A search on non-composite fields do not generate dynamic rank. For this you will need to wrap it in a composite field, which is perfectly ok. However, when you bliss (upload) an index profile, ESP will spew out warnings. These can be ignored. (But remember to add the composite field reference to the rank profile, as usual, or it will have no effect.)</p>
<p><strong>The context/occurence oddity<br />
</strong>An often encountered problem is a composite with fields that have values contain the same tokens (words). A search will then get context hits in multiple fields and get rank contributions from each. However, in most cases you would only like to have a rank contribution from the hit in the highest weighted field. I have tried to turn off whatever I could find in config files, but not been able to solve this problem. There are however a few ways around this.</p>
<p>One involves ripping out duplicate words from the lower weighted fields during document processing, but that makes those fields useless for other purposes.</p>
<p>Another solution involves splitting the fields into singular composite fields and rewriting the query to contain a lot of parts searching each field with the same words, and joining them with the ANY operator.</p>
<p>A third solution is join the fields in question in a field-ref-group in the composite field. This will count multiple field hits as a hit the one group and assign a rank contribution according to the field-ref-group&#8217;s weight. But you will no longer be able to assign individual weights to each field.</p>
<p><strong>The default oddity<br />
</strong>One composite field must be tagged with</p><pre class="crayon-plain-tag">default=&quot;yes&quot;</pre><p>If you have no default composite field, ESP can start to act funny, such as swapping int32 and string types in the result(!)</p>
<p><strong>The quality oddity<br />
</strong>This refers to the static boost contribution to the result defined by the &#8220;quality&#8221; element of the rank-profile.</p>
<p>If the quality element is not present, the default weight is 50, and the default quality field is &#8220;hwboost&#8221;. This is a magic field that is hard coded and not defined in the index profile. However, try to specify</p><pre class="crayon-plain-tag">&amp;lt;quality weight=&quot;50&quot; field-ref=&quot;hwboost&quot;/&amp;gt;</pre><p>and you will get an error that the field is not defined. This can seemingly safely be explicitly defined in the index profile as</p><pre class="crayon-plain-tag">&amp;lt;field name=&quot;hwboost&quot; type=&quot;uint32&quot; index=&quot;yes&quot; sort=&quot;yes&quot;/&amp;gt;</pre><p>The default start value of hwboost is 10000, and this can be added to or subtracted from during document processing.</p>
<p>The quality weight is limited to steps of 50 (0, 50, 100, 150, &#8230;). These values are actually transformed to multipliers 0, 1, 2, &#8230; So a weight value of &#8220;50&#8243; does not mean half or 50%. With a default hwboost, a weight of &#8220;50&#8243; transforms to multiplier 1, i.e. 1*10000=10000.</p>
<p>You can specify your own quality field. It must be of type uint32 (not int32!), index and sorting set to &#8220;yes&#8221;.</p>
<p>After changing the values, run</p><pre class="crayon-plain-tag">bliss-core -C index-profile.xml
view-admin -m refresh</pre><p>and then wait a minute or so for the views to refresh.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2010/11/10/fast-esp-5-3-index-profile-oddities/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>
