<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; analyzers</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/analyzers/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Searching for &#8220;miljø&#8221; inside of &#8220;arbeidsmiljø&#8221; using Elasticsearch and the ngram tokenizer</title>
		<link>http://blog.comperiosearch.com/blog/2014/06/11/searching-miljo-inside-arbeidsmiljo-using-elasticsearch-ngram-tokenizer/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/06/11/searching-miljo-inside-arbeidsmiljo-using-elasticsearch-ngram-tokenizer/#comments</comments>
		<pubDate>Tue, 10 Jun 2014 22:59:58 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[analyzers]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[language]]></category>
		<category><![CDATA[ngram]]></category>
		<category><![CDATA[norwegian]]></category>
		<category><![CDATA[underbukser]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2495</guid>
		<description><![CDATA[Compound words are a big problem for Norwegians. The young don&#8217;t know how to use them, search engines struggle with them as well. Elasticsearch and the ngram tokenizer offers one possible solution. There is a Facebook  group dedicated to the task of spreading the knowledge, using images showing the difference between for instance &#8220;underbukser&#8221; (under wear) and &#8220;under [...]]]></description>
				<content:encoded><![CDATA[<p>Compound words are a big problem for Norwegians. The young don&#8217;t know how to use them, search engines struggle with them as well. Elasticsearch and the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html">ngram tokenizer</a> offers one possible solution.<br />
<span id="more-2495"></span></p>
<p>There is a <a href="https://www.facebook.com/photo.php?fbid=687044034678144&amp;set=a.499093253473224.1073741826.499091746806708&amp;type=1&amp;theater">Facebook</a> <a href="https://www.facebook.com/photo.php?fbid=687044034678144&amp;set=a.499093253473224.1073741826.499091746806708&amp;type=1&amp;theater"> </a>group dedicated to the task of spreading the knowledge, using images showing the difference between for instance &#8220;underbukser&#8221; (under wear) and &#8220;under bukser&#8221; (positioned below trousers).</p>
<div style="width: 301px" class="wp-caption alignright"><a href="https://www.facebook.com/photo.php?fbid=687044034678144&amp;set=a.499093253473224.1073741826.499091746806708&amp;type=1&amp;theater"><img src="https://fbcdn-sphotos-a-a.akamaihd.net/hphotos-ak-xpa1/t1.0-9/q71/s720x720/10294335_687044034678144_6234252359337064096_n.jpg" alt="" width="291" height="206" /></a><p class="wp-caption-text">Underwear  or under wear. Not the same thing! <br />Photo: André Ulveseter</p></div>
<p>&nbsp;</p>
<p>Elasticsearch offers a wide range of <a href="http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis.html">analysing options</a>. The <a href="http://en.wikipedia.org/wiki/N-gram">ngram</a> tokenizer splits a string into a series of continuous letters. For instance &#8220;underbukser&#8221;, with a size two ngram would split the word into &#8220;un&#8221; &#8220;nd&#8221; &#8220;de&#8221; &#8220;er &#8220;rb&#8221; &#8220;bu &#8220;uk&#8221; &#8220;ks&#8221; &#8220;se&#8221; &#8220;er&#8221;.  Elasticsearch will <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/analysis-intro.html">use the same analyzer </a> when querying the field, so if we search for &#8220;bukser&#8221; it will be split into &#8220;bu&#8221;, &#8220;uk&#8221;, and so on, and matches will result.</p>
<p>Well, enough chit chat, time for some code. Using the excellent <a href="https://found.no/play/">Play </a> tool created by Elasticsearch experts found.no we can even test it all out in our browser, no need for a server, you can do it at home on your ipad, chromebook, or even on a windows phone.</p>
<div id="attachment_2508" style="width: 212px" class="wp-caption alignright"><img class="wp-image-2508 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2014/06/myAnalyzer-in-Action-202x300.png" alt="myAnalyzer in Action" width="202" height="300" /><p class="wp-caption-text">myAnalyzer in Action, showing how the word is split into ngrams</p></div>
<p>Here is <a href="https://found.no/play/gist/4ec22e1e67c9c5f9bcc0#">a simple demo showing</a> the ngram tokenizer in action.<br />
The demo has three documents indexed, all containing the field foo with respective values &#8220;arbeidsmiljø&#8221;, &#8220;arbeid&#8221;, and &#8220;arbeidsmiljøloven&#8221;.<br />
It uses an analyzer aptly called &#8220;myAnalyzer&#8221;,  this analyzer is using a custom tokenizer called &#8220;my_toknizer&#8221;, where the actual ngram tokenization is taking place.  The  ngrams for this sample are created with sizes ranging from 2 to 3. Testing it out on found.no/play, you can see how the various stages of the analyzer modifies the text. Neat!</p>
<p>The mapping enables the &#8220;myAnalyzer&#8221; to be used for  the &#8220;foo&#8221; field. Finally, I create a query, for the term &#8220;miljø&#8221;, which I expect to be found in the middle of documents nr. 2 and 3. Pressing the &#8220;run&#8221; button executes the setup, displaying the search results at the bottom right.</p>
<p>If you are really interested to learn more about analyzers, try the <a href="http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/languages.html">elasticsearch guide on languages</a>, which is getting better day by day, or the articles on <a href="http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams">qbox on autocomplete using ngrams</a>  or  <a href="http://www.found.no/foundation/">found.no on language analyzers</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/06/11/searching-miljo-inside-arbeidsmiljo-using-elasticsearch-ngram-tokenizer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
