<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; English</title>
	<atom:link href="http://blog.comperiosearch.com/blog/category/english/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Experimenting with Open Source Web Crawlers</title>
		<link>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/</link>
		<comments>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/#comments</comments>
		<pubDate>Fri, 29 Apr 2016 11:03:42 +0000</pubDate>
		<dc:creator><![CDATA[Mridu Agarwal]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[OpenWebSpider]]></category>
		<category><![CDATA[Scrapy]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[Web Crawling]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=4080</guid>
		<description><![CDATA[Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site,  web scraping has many uses. In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not [...]]]></description>
				<content:encoded><![CDATA[<p lang="en-US">Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site,  web scraping has many uses.</p>
<p lang="en-US">In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not only easily available but quite powerful as well. In this article I am mostly going to cover their basic features and how easy they are to start with.</p>
<p lang="en-US">If you are like one of those persons who likes to quickly get started while learning something, I would suggest that you try <a href="http://www.openwebspider.org/">OpenWebSpider</a> first.</p>
<p lang="en-US">It is a simple web browser based open source crawler and search engine which is simple to install and use and is very good for those who are trying to get acquainted to web crawling . It stores webpages in MySql or MongoDb. I used MySql for my testing purpose. You can follow the steps <a href="http://www.openwebspider.org/documentation/openwebspider-js/">here</a> to install it. It&#8217;s pretty simple and basic.</p>
<p lang="en-US">So, once you have installed everything , you just need to open a web-browser at <a href="http://127.0.0.1:9999/">http://127.0.0.1:9999/</a> and you are ready to crawl and search. Just check your database settings, type the Url of the site you want to crawl and within couple of minutes, you have all the data you need. You can even search it going to the search tab and typing in your query. Whoa! That was quick and compact and needless to say you don’t need any programming skills to crawl it.</p>
<p lang="en-US">If you are trying to create an off-line copy of your data or your very own mini Wikipedia, I think go for this as it’s the easiest way to do it.</p>
<p lang="en-US">Following are some screen shots:</p>
<p lang="en-US"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS1.png"><img class="alignleft wp-image-4083 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS1.png" alt="OpenWebSpider" width="613" height="438" /></a></p>
<p lang="en-US"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS2.png"><img class="alignleft wp-image-4086 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS2.png" alt="OpenSearchWeb" width="611" height="441" /></a></p>
<p lang="en-US" style="text-align: left"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS3.png"><img class="alignleft size-full wp-image-4087" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/OS3.png" alt="OpenSearchWeb" width="611" height="441" /></a></p>
<p lang="en-US" style="text-align: left">You can also see the this Search engine demo <a href="http://lab.openwebspider.org/search_engine/">here</a>, before actually getting started.</p>
<p lang="en-US" style="text-align: left">Ok, after getting my hands on into web crawling, I was curious to do  more sophisticated stuff like extracting topics from a web site where I do not have any RSS feed or API. Extracting this structured data could be quite important to many business scenarios where you are trying to follow competitor&#8217;s product news or gather data for business intelligence. I decided to use <a href="http://scrapy.org/">Scrapy</a> for this experiment.</p>
<p lang="en-US" style="text-align: left">The good thing about Scrapy is that it is not only fast and simple, but very extensible as well. While installing it on my windows environment, I had few hiccups mainly because of the different compatible version of python but in the end, once you get it, it&#8217;s very simple(Isn&#8217;t that how you feel anyways , once things works ? Anyways, forget it! :D). Follow these links, if you are having trouble installing Scrapy like me:</p>
<p lang="en-US" style="text-align: left"><a href="https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment">https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment</a></p>
<p lang="en-US" style="text-align: left"><a href="http://doc.scrapy.org/en/latest/intro/install.html#intro-install">http://doc.scrapy.org/en/latest/intro/install.html#intro-install</a></p>
<p lang="en-US" style="text-align: left">After installing, you need to create a Scrapy project. Since we are doing more customized stuff than just crawling the entire website, this requires more effort and knowledge of programming skills and sometime browser tools to understand the HTML DOM. You can follow <a href="http://doc.scrapy.org/en/latest/intro/overview.html">this</a> link to get started with you first Scrapy project .Once you have crawled the data that you need, it would be interesting to feed this data into a search engine. I have also been looking for open source web crawlers for Elastic Search and this looked like the perfect opportunity. Scrapy provides integration with Elastic Search out of the box , which is awesome. You just need to install the Elastic Search module for Scrapy(of course Elastic Search should be running somewhere) and configure the Item Pipeline for Scrapy. Follow <a href="http://blog.florian-hopf.de/2014/07/scrapy-and-elasticsearch.html">this</a> link for the step by step guide. Once done, you have the fully integrated crawler and search system!</p>
<p lang="en-US" style="text-align: left">I crawled <a href="http://primehealthchannel.com">http://primehealthchannel.com</a> and created an index named &#8220;healthitems&#8221; in Scrapy.</p>
<p lang="en-US" style="text-align: left">To search the elastic search index, I am using Chrome extension <span style="font-weight: bold">Sense</span> to send queries to Elastic Search, and this is how it looks</p>
<p lang="en-US" style="text-align: left">GET /scrapy/healthitems/_search</p>
<p style="text-align: left"><a href="http://blog.comperiosearch.com/wp-content/uploads/2016/04/ES1.png"><img class="alignleft wp-image-4082 size-large" src="http://blog.comperiosearch.com/wp-content/uploads/2016/04/ES1-1024x597.png" alt="Elastic Search" width="1024" height="597" /></a></p>
<p lang="en-US" style="text-align: left">I hope you had fun reading this and now wants to try some of your own cool ideas . Do let us know how you used it and which crawler you like the most!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2016/04/29/experimenting-with-open-source-web-crawlers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Content Enrichment Web Service SharePoint 2013 &#8211; Advantages and Challenges</title>
		<link>http://blog.comperiosearch.com/blog/2016/04/26/content-enrichment-web-service-sharepoint-2013/</link>
		<comments>http://blog.comperiosearch.com/blog/2016/04/26/content-enrichment-web-service-sharepoint-2013/#comments</comments>
		<pubDate>Tue, 26 Apr 2016 11:23:22 +0000</pubDate>
		<dc:creator><![CDATA[Mridu Agarwal]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[CEWS]]></category>
		<category><![CDATA[Content Enrichment Web Service]]></category>
		<category><![CDATA[FAST Search for SharePoint]]></category>
		<category><![CDATA[SharePoint 2013 Search]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=4017</guid>
		<description><![CDATA[If you have worked with search solutions before, you will know that very often there is a need to process data before it can be displayed in search results. This processing might be required to address some of(but not limited to) these common issues: Missing metadata issues Inconsistent metadata issues Cleansing of content Integration of semantic [...]]]></description>
				<content:encoded><![CDATA[<p>If you have worked with search solutions before, you will know that very often there is a need to process data before it can be displayed in search results. This processing might be required to address some of(but not limited to) these common issues:</p>
<ul>
<li>Missing metadata issues</li>
<li>Inconsistent metadata issues</li>
<li>Cleansing of content</li>
<li>Integration of semantic layers/Automatic tagging</li>
<li>Integration with 3rd party service</li>
<li>Merging data from other sources</li>
</ul>
<p><strong>Content Enrichment Web Service</strong> in SharePoint 2013 is a SOAP-based service within the content processing component that can be used to achieve this. The figure below shows a part of the process that takes place in the content processing component of SharePoint search. <img src="https://i-msdn.sec.s-msft.com/dynimg/IC618173.gif" alt="Content enrichment within content processing" width="481" height="286" /></p>
<p>Content Enrichment Web Service SharePoint 2013 combines the goodness of both <strong>FAST for SharePoint Search</strong> and <strong>SharePoint Search </strong> to offer a whole new set of possibilities and has its own challenges. To see an implementation example, check the <a href="https://msdn.microsoft.com/en-us/library/office/jj163982.aspx">MSDN link</a> which pretty much sums up the basic steps. In this post we are going to look at some of the advantages and challenges of CEWS coming from a FAST 2010 background:</p>
<p>1.<strong> CEWS is a service and you DON&#8217;T have to deploy it in your SharePoint environment</strong>: Perhaps this is the biggest architectural change  from the content processing perspective. What this means is that your code no longer runs in a sandbox environment within <strong>SharePoint Server</strong>. The webservice can be hosted anywhere outside your SharePoint server thus reducing deployment headaches and huge number of approvals required to deploy the executable files. I can see operations/infrastructure team/administrators smiling.</p>
<p>2.<strong>The web service processes and returns managed properties, not crawled properties: </strong>Managed properties correspond to what actually gets indexed and displayed in search results. So, this reduces some of the confusion as why I cant see the updated results( perhaps you had forgotten to map your crawled property to a managed property and wait you will have to index it AGAIN!). Nightmare!</p>
<p>3. <strong>You can define a trigger to limit the set of items that are processed by the web service: </strong>In FAST 2010, each item had to pass through the pipeline whether you wanted to process it or not. This check had to be done in the code. Trigger in 2013 will allow us to define this check outside the code so that only for selected content, web service is called. This will optimize the overall performance and improve crawling time, if you only want to process a subset of the content.</p>
<blockquote><p>So far, so good! But.. there are certain challenges we need to look at and see how we can overcome it. In fact, this is the most important part when you are architecting your CEWS solution:</p></blockquote>
<p>1. <strong>The content enrichment callout step can only be configured with a single web service endpoint :</strong> Now this sounds very limiting.  I have multiple search applications and earlier I maintained the logic in different solutions. Do I need to combine them all into a single service? What about the maintenance and change request? Well there are several possible technologies one could consider to solve this but what I did in my project was to create a WCF routing service and let the routing service handle my multiple web services based on filters. You could also use it to implement load-balancing and fault tolerance. Here in the following example, I have two content sources &#8220;xmlfile&#8221; and &#8220;EpiFileShare&#8221;. I want to have two different services &#8220;xmlsvc&#8221; and &#8220;episvc&#8221; to process these different sources. This is how I will configure the end points in my WCF Routing Service:   <a href="http://blog.comperiosearch.com/wp-content/uploads/2016/01/router.png"><img class="aligncenter  wp-image-4027" src="http://blog.comperiosearch.com/wp-content/uploads/2016/01/router-1024x278.png" alt="endpoints" width="708" height="192" /></a> 2.<strong> Only one condition can be configured for Trigger. Different search application will require different triggers: </strong>Now, this can again be solved by using WCF routers and filters and configuring separate endpoints for separate triggers. Here I am using default managed property &#8220;ContentSource&#8221; as a trigger/filter to determine my service endpoint. <a href="http://blog.comperiosearch.com/wp-content/uploads/2016/01/rouyer.png"><img class="aligncenter wp-image-4025 " src="http://blog.comperiosearch.com/wp-content/uploads/2016/01/rouyer-1024x286.png" alt="config file" width="737" height="206" /></a> To summarize, I have shown some of the advantages and challenges of the new CEWS architecture in SharePoint 2013 search and how you can overcome it. Hope that now you want  to try this soon and share your experience with us.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2016/04/26/content-enrichment-web-service-sharepoint-2013/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ELK stack deployment with Ansible</title>
		<link>http://blog.comperiosearch.com/blog/2015/11/26/elk-stack-deployment-with-ansible/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/11/26/elk-stack-deployment-with-ansible/#comments</comments>
		<pubDate>Thu, 26 Nov 2015 09:59:38 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[ansible]]></category>
		<category><![CDATA[deployment]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[elk]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3999</guid>
		<description><![CDATA[As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way. Ansible is a free software platform for configuring and managing computers, and I’ve been using [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignright" src="http://www.ansible.com/hs-fs/hub/330046/file-767051897-png/Official_Logos/ansible_circleA_red.png?t=1448391213471" alt="" width="251" height="251" />As human beings, we like to believe that each and every one of us is a special individual, and not easily replaceable. That may be fine, but please, don’t fall into the habit of treating your computer the same way.</p>
<p><span id="more-3999"></span></p>
<p><a href="https://en.wikipedia.org/wiki/Ansible_(software)"><b>Ansible</b> </a>is a <a href="https://en.wikipedia.org/wiki/Free_software">free software</a> platform for configuring and managing computers, and I’ve been using it a lot lately to manage the ELK stack. Elasticsearch, Logstash and Kibana.</p>
<p>I can define a list of servers I want to manage in a YAML config file &#8211; the so called inventory:</p><pre class="crayon-plain-tag">[elasticearch-master]
es-master1.mydomain.com
es-master2.mydomain.com
es-master3.mydomain.com

[elasticsearch-data]
elk-data1.mydomain.com
elk-data2.mydomain.com
elk-data3.mydomain.com

[kibana]
kibana.mydomain.com</pre><p>And define the roles for the servers in another YAML config file &#8211; the so called playbook:</p><pre class="crayon-plain-tag">- hosts: elasticsearch-master
  roles:
    - ansible-elasticsearch

- hosts: elasticsearch-data
  roles:
    - ansible-elasticsearch

- hosts: logstash
  roles:
    - ansible-logstash

- hosts: kibana
  roles:
    - ansible-kibana</pre><p>&nbsp;</p>
<p>Each group of servers may have their own files containing configuration variables.</p><pre class="crayon-plain-tag">elasticsearch_version: 2.1.0
elasticsearch_node_master: false
elasticsearch_heap_size: 1000G</pre><p>&nbsp;</p>
<p>Ansible is used for configuring the ELK stack vagrant box at <a href="https://github.com/comperiosearch/vagrant-elk-box-ansible">https://github.com/comperiosearch/vagrant-elk-box-ansible</a>, which was recently upgraded with Elasticsearch 2.1, Kibana 4.3 and Logstash 2.1</p>
<p>The same set of Ansible roles can be applied when the configuration needs to move into production, by applying another set of variable files with modified host names, certificates and such. The possible ways to do this are several.</p>
<p><b>How does it work?</b></p>
<p>Ansible is agent-less. This means, you do not install anything (an agent) on the machines you control. Ansible needs only to be installed on the controlling machine (Linux/OSX) and  connects to the managed machines (some support for windows, even) using SSH. The only requirement on the managed machines is python.</p>
<p>Happy ansibling!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/11/26/elk-stack-deployment-with-ansible/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Elasticsearch: Shield protected Kibana with Active Directory</title>
		<link>http://blog.comperiosearch.com/blog/2015/08/21/elasticsearch-security-shield/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/08/21/elasticsearch-security-shield/#comments</comments>
		<pubDate>Fri, 21 Aug 2015 14:26:45 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[enterprise]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3245</guid>
		<description><![CDATA[Elasticsearch easily stores terabytes of data, but how can you make sure users only see the data they should? This post will explore how to use Shield, a plugin for Elasticsearch, to authenticate users with Active Directory. Elasticsearch will by default allow anyone access to all data. The Shield plugin allows locking down Elasticsearch using authentication [...]]]></description>
				<content:encoded><![CDATA[<p>Elasticsearch easily stores terabytes of data, but how can you make sure users only see the data they should? This post will explore how to use Shield, a plugin for Elasticsearch, to authenticate users with Active Directory.</p>
<p><span id="more-3245"></span><br />
<a title="NO TRESPASSING" href="https://www.flickr.com/photos/mike2099/2058021162/in/photolist-48RTZu-4ttdcn-4YPqqU-5WbRAP-8rYugF-XsCao-ftZ1hL-dpmFB-dqyeUE-bjV3VY-bEMba3-bEMb6w-84YCqg-rf5Yk1-8Yjaj3-chg68s-4KDN1M-4KDMWF-5MfWjA-tCJt6J-8nxBiZ-6YsUyh-KfDRK-54uLmy-bv1Pv-oChdLk-pL3X8t-4RTTjd-dhfUPn-cEkCFY-czjXiE-m1zThD-dzESFD-oj2KUM-c16MV-72dTxS-g4Yky4-kK9YR-p6DYnY-5HJvrX-8aovPQ-dhfVkP-bwB8c-gFzTXk-7zd9iF-eua6KC-2gzEc-8nxtcH-2gzEb-fnp3zH" data-flickr-embed="true"><img src="https://farm3.staticflickr.com/2059/2058021162_ed7b6e8d72_b.jpg" alt="NO TRESPASSING" width="600" /></a><script src="//embedr.flickr.com/assets/client-code.js" async="" charset="utf-8"></script></p>
<p>Elasticsearch will by default allow anyone access to all data. The <a href="https://www.elastic.co/guide/en/shield/current/introduction.html">Shield</a> plugin allows locking down Elasticsearch using authentication from the internal esusers realm, Active Directory (AD)  or LDAP . Using AD, you can map groups defined in your Windows domain to roles in Elasticsearch. For instance, you can allow people in the Fishery department access only to  fish-indexes, and give complete control to anyone in the IT department.</p>
<p>To use Shield in production, you have to buy an Elasticsearch subscription, however, you get a 30-day trial when installing the license manager. So let&#8217;s hurry up and see how this works out in Kibana.</p>
<p>&nbsp;</p>
<p>In this post, we will install Shield and connect to Active Directory (AD) for authentication. After having made sure we can authenticate with AD, we will add SSL encryption everywhere possible. We will add authentication for the Kibana server using the built in authentication realm esusers, and if time allows at the end, we will create two user groups, each with access to its own index, and check how it all looks when accessed in Kibana 4.</p>
<p>&nbsp;</p>
<h3>Prerequisites</h3>
<p>You will need a previously installed Elasticsearch and Kibana. The most recent versions should work, I have used Elasticsearch 1.7 and Kibana 4.1.1  If you need a machine to test on, I can personally recommend the vagrant-elk-box you can find <a href="https://github.com/comperiosearch/vagrant-elk-box-ansible">here</a>: <strong>The following guide assumes the file locations of the vagrant-elk-box</strong>, if you install differently, you will probably know where to look. Ask an adult for help.</p>
<p>For Active Directory, you need to be on a domain that uses Active Directory. That would probably mean some kind of Windows work environment.</p>
<p>&nbsp;</p>
<h4>Installing Shield</h4>
<p>If you&#8217;re on the vagrant box you should begin the lesson by entering the vagrant box using the commands</p><pre class="crayon-plain-tag">vagrant up
vagrant ssh</pre><p>&nbsp;</p>
<p>Install the license manager</p><pre class="crayon-plain-tag"> sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/license/latest</pre><p>Install Shield</p><pre class="crayon-plain-tag"> sudo /usr/share/elasticsearch/bin/plugin -i elasticsearch/shield/latest</pre><p>Restart elasticsearch. (service elasticsearch restart)</p>
<p>Check out the logs,  you should find some information regarding when your Shield license will expire (logfile location:  /var/log/elasticsearch/vagrant-es.log)</p>
<h4>Integrating Active Directory</h4>
<p>The next step involves figuring out a thing or two about your Active Directory configuration. First of all you need to know the address. Now you need to be on  your windows machine, open cmd.exe and type</p><pre class="crayon-plain-tag">set LOGONSERVER</pre><p>The name of your AD should pop back.  Add a section similar to the following into the elasticsearch.yml file (at /etc/elasticsearch/elasticsearch.yml)</p><pre class="crayon-plain-tag">shield.authc.realms:
  active_directory:
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com</pre><p>Type in the address to your AD in the url: field (where it says url: ldap://ad.superdomain.com). If your logonserver is ad.cnn.com, you should type in url: ldap://ad.cnn.com</p>
<p>Also, you need to figure out your domain name and type it in correctly.</p>
<p>NB: Be careful with the indenting! Elasticsesarch cares a lot about correct indenting, and may even refuse to start without telling you why if you make a mistake.</p>
<h5>Finding the Correct name for the Active Directory group</h5>
<p>Next step involves figuring out the name for the Group you wish to grant access to. You may have called your group &#8220;Fishermen&#8221;, but that is probably not exactly what it&#8217;s called in AD.</p>
<p>Microsoft has a very simple and nice tool called <a href="https://technet.microsoft.com/en-us/library/bb963907.aspx">Active Directory Explorer</a> . Open the tool and enter the adress you just found from the LOGONSERVER (remember? it&#8217;s only 10 lines above)</p>
<p>You may have to click and explore a little to find the groups you want. Once you find it, you need the value for the &#8220;distinguishedName&#8221; attribute. You can double click on it and copy out from the &#8220;Object&#8221;.</p>
<p>This is an example from my AD</p><pre class="crayon-plain-tag">CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com</pre><p>Now this value represents a group which we want to map to a role in elasticsearch.</p>
<p>Open the file /etc/elasticsearch/shield/role-mapping.yml. It should look similar to this</p><pre class="crayon-plain-tag"># Role mapping configuration file which has elasticsearch roles as keys
# that map to one or more user or group distinguished names

#roleA:   this is an elasticsearch role
#  - groupA-DN  this is a group distinguished name
#  - groupB-DN
#  - user1-DN   this is the full user distinguished name
power_user:
  - "CN=Rolle IT,OU=Groups,OU=Oslo,OU=Comperiosearch,DC=comperiosearch,DC=com"
#user:
# - "cn=admins,dc=example,dc=com" 
# - "cn=John Doe,cn=other users,dc=example,dc=com"</pre><p>I have uncommented the line with &#8220;power_user:&#8221; and added a line below containing the distinguishedName from above.</p>
<p>By restarting elasticsearch, anyone in the &#8220;Rolle IT&#8221; group should now be able to log in (and nobody else (yet)).</p>
<p>To test it out, open <a href="http://localhost:9200">http://localhost:9200</a> in your browser. You should be presented with a login box where you can type in your username/password. In case of failure, check out the elasticsearch logs (at /var/log/elasticsearch/vagrant-es.log).</p>
<p>If you were able to log in, that means Active Directory authentication works. Congratulations!  You deserve a refreshment. Some strong coffee, will go down well with the next sections, where we add encrypted communications everywhere we can.</p>
<h3>SSL  - Elasticsearch</h3>
<p>Authentication and encrypted communication go hand in hand. Without SSL, username and password is transferred in plaintext on the wire. For this demo we will use self-signed certificates. Keytool comes with Java, and is used to handle certificates for Elasticsearch.  The following command will generate a self-signed certficate and put it in a JKS file named self-signed.jks. (swap out  $password with your preferred password)</p><pre class="crayon-plain-tag">keytool -genkey -keyalg RSA -alias selfsigned -keystore self-signed.jks -keypass $password -storepass $password -validity 360 -keysize 2048 -dname "CN=localhost, OU=orgUnit, O=org, L=city, S=state, C=NO"</pre><p>Copy the certificate into /etc/elasticsearch/</p>
<p>Modify  /etc/elasticsearch/elasticsearch.yml by adding the following lines:</p><pre class="crayon-plain-tag">shield.ssl.keystore.path: /etc/elasticsearch/self-signed.jks
shield.ssl.keystore.password: $password
shield.ssl.hostname_verification: false
shield.transport.ssl: true
shield.http.ssl: true</pre><p>(use the same password as you used when creating the self-signed certificate )</p>
<p>Restart Elasticsearch again, and watch the logs for failures.</p>
<p>Try to open https://localhost:9200 in your browser (NB: httpS not http)</p>
<div id="attachment_3905" style="width: 310px" class="wp-caption alignright"><img class="wp-image-3905 size-medium" src="http://blog.comperiosearch.com/wp-content/uploads/2015/08/your-connection-is-not-private-e1440146932126-300x181.png" alt="your connection is not private" width="300" height="181" /><p class="wp-caption-text">https://localhost:9200</p></div>
<p>You should a screen warning you that something is wrong with the connection. This is a good sign! It means your certificate is actually working! For production use you could use your own CA or buy a proper certificate, which both will avoid the ugly warning screen.</p>
<h4>SSL &#8211; Active directory</h4>
<p>Our current method of connecting to Active Directory is unencrypted &#8211; we need to enable SSL for the AD connections.</p>
<p>1. Fetch the certificate from your Active Directory server (replace ldap.example.com with the LOGONSERVER from above)</p><pre class="crayon-plain-tag">echo | openssl s_client -connect ldap.example.com:6362&gt;/dev/null| openssl x509 &gt; ldap.crt</pre><p>2. Import the certificate into your keystore (located at /etc/elasticsearch/)</p><pre class="crayon-plain-tag">keytool -import-keystore self-signed.jks -file ldap.crt</pre><p>&nbsp;</p>
<p>3. Modify AD url in elasticsearch.yml<br />
change the line</p><pre class="crayon-plain-tag">url: ldap://ad.superdomain.com</pre><p>to</p><pre class="crayon-plain-tag">url: ldaps://ad.superdomain.com</pre><p>Restart elasticsearch and check logs for failures</p>
<h4>Kibana authentication with esusers</h4>
<p>With Elasticsearch locked down by Shield, it means no services can search or post data either. Including Kibana and Logstash.</p>
<p>Active Directory is great, but I&#8217;m not sure I want to use it for letting the Kibana server talk to Elasticsearch. We can use the Shield built in user management system, esusers. Elasticsearch comes with a set of predefined roles, including roles for Logstash, Kibana4 server and Kibana4 user. (/etc/elasticsearch/shield/role-mapping.yml on the vagrant-elk box if you&#8217;re still on that one).</p>
<p>Add a new kibana4_server user, granting it the role kibana4_server, using this command:</p><pre class="crayon-plain-tag">cd /usr/share/elasticsearch/bin/shield  
./esusers useradd kibana4_server -p secret -r kibana4_server</pre><p></p>
<h4></h4>
<h4>Adding esusers realm</h4>
<p>The esusers realm is the default one, and does not need to be configured if that&#8217;s the only realm you use. Now since we added the Active Directory realm we must add another section to the elasticsearch.yml file from above.</p>
<p>It should end up looking like this</p><pre class="crayon-plain-tag">shield.authc.realms:
  esusers:
    type: esusers
    order: 0
  active_directory:
    order: 1
    type: active_directory
    domain_name: superdomain.com
    unmapped_groups_as_roles: true
    url: ldap://ad.superdomain.com</pre><p>The order parameter defines in what order elasticsearch should try the various authentication mechanisms.</p>
<h4>Allowing Kibana to access Elasticsearch</h4>
<p>Kibana must be informed of the new user we just created. You will find the kibana configuration file at /opt/kibana/config/kibana.yml.</p>
<p>Add in the username and password you just created. You also need to change the address for elasticsearch to using https</p><pre class="crayon-plain-tag"># The Elasticsearch instance to use for all your queries.
elasticsearch_url: "https://localhost:9200"

# If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibana4_server
kibana_elasticsearch_password: secret</pre><p>Restart kibana and elasticsearch, and watch the logs for any errors. Try opening Kibana at  http://localhost:5601, type in your login and password. Provided you&#8217;re in the group you gave access earlier, you should be able to login.</p>
<h4></h4>
<h4>Creating SSL for Kibana</h4>
<p>Once you have enabled authorization for Elasticsearch, you really need to set SSL certificates for Kibana as well. This is also configured in kibana.yml</p><pre class="crayon-plain-tag">verify_ssl: false
# SSL for outgoing requests from the Kibana Server (PEM formatted)
ssl_key_file: "kibana_ssl_key_file"
ssl_cert_file: "kibana_ssl_cert_file"</pre><p>You can create a self-signed key and cert file for kibana using the following command:</p><pre class="crayon-plain-tag">openssl req -x509 -newkey rsa:2048 -keyout key.pem -out cert.pem -days 365 -nodes</pre><p>&nbsp;</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/08/kibana-auth.png"><img class="alignright size-medium wp-image-3920" src="http://blog.comperiosearch.com/wp-content/uploads/2015/08/kibana-auth-300x200.png" alt="kibana auth" width="300" height="200" /></a></p>
<h4>Configuring AD groups for Kibana access</h4>
<p>Unfortunately, this part of the post is going to be very sketchy, as we are desperately running out of time. This blog is much too long already.</p>
<p>Elasticsearch already comes with a list of predefined roles, among which you can find the kibana4 role.  The kibana4 role allows read/write access to the .kibana index, in addition to search and read access to all indexes. We want to limit access to just one index for each AD group. The fishery group shall only access the fishery index, and the finance group shall only acess the finance index. We can create roles that limit access to one index by copying the kibana4 role, giving it an appropriate name and changing the index:&#8217;*&#8217; section to map to only the preferred index.</p>
<p>The final step involves mapping the Elasticsearch role into an AD role. This is done in the role_mapping.yml file, as mentioned above.</p>
<p>Only joking of course, that wasn&#8217;t the last step. The last step is restarting Elasticsearch, and checking the logs for failures as you try to log in.</p>
<p>&nbsp;</p>
<h3>Securing Elasticsearch</h3>
<p>Shield brings enterprise authentication to Elasticsearch. You can easily manage access to various parts of  Elasticsearch management and data by using Active Directory groups.</p>
<p>This has been a short dive into the possibilities, make sure to contact Comperio if you should need  help in creating a solution with Elasticsearch and Shield.</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/08/21/elasticsearch-security-shield/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How Elasticsearch calculates significant terms</title>
		<link>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/#comments</comments>
		<pubDate>Wed, 10 Jun 2015 11:02:28 +0000</pubDate>
		<dc:creator><![CDATA[André Lynum]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[aggregations]]></category>
		<category><![CDATA[lexical analysis]]></category>
		<category><![CDATA[relevance]]></category>
		<category><![CDATA[significant terms]]></category>
		<category><![CDATA[word analysis]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3785</guid>
		<description><![CDATA[Many of you who use Elasticsearch may have used the significant terms aggregation and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative [...]]]></description>
				<content:encoded><![CDATA[<div id="attachment_3823" style="width: 310px" class="wp-caption aligncenter"><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/uncommonlycommon-300x187.png" alt="The &quot;unvommonly common&quot;" width="300" height="187" class="size-medium wp-image-3823" /></a><p class="wp-caption-text">The magic of the &#8220;uncommonly common&#8221;.</p></div>
<p>Many of you who use Elasticsearch may have used the <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html" title="significant terms">significant terms aggregation</a> and been intrigued by this example of fast and simple word analysis. The details and mechanism behind this aggregation tends to be kept rather vague however and couched in terms like &#8220;magic&#8221; and the commonly uncommon. This is unfortunate since developing informative analyses based on this aggregation requires some adaptation to the underlying documents especially in the face of less structured text. Significant terms seems especially susceptible to garbage in &#8211; garbage out effects and developing a robust analysis requires some understanding of the underlying data. In this blog post we will take a look at the default relevance score used by the significance terms aggregation, the mysteriously named JLH score, as it is implemented in Elasticsearch 1.5. This score is especially developed for this aggregation and experience shows that it tends to be the most effective one available in Elasticsearch at this point.</p>
<p>The JLH relevance scoring function is not given in the documentation. A quick dive into the code however and we find the following scoring function.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH+%3D+%5Cleft%5C%7B%5Cbegin%7Bmatrix%7D++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%26+p_%7Bfore%7D+-+p_%7Bback%7D+%3E+0+%5C%5C++0++%26+elsewhere++%5Cend%7Bmatrix%7D%5Cright.++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' title='  JLH = \left\{\begin{matrix}  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} &amp; p_{fore} - p_{back} &gt; 0 \\  0  &amp; elsewhere  \end{matrix}\right.  ' class='latex' />
<p>Here the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is the frequency of the term in the foreground (or query) document set, while <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is the term frequency in the background document set which by default is the whole index.</p>
<p>Expanding the formula gives us the following which is quadratic in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' />.</p>
<img src='http://s0.wp.com/latex.php?latex=++%28p_%7Bfore%7D+-+p_%7Bback%7D%29%5Cfrac%7Bp_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' title='  (p_{fore} - p_{back})\frac{p_{fore}}{p_{back}} = \frac{p_{fore}^2}{p_{back}} - p_{fore}  ' class='latex' />
<p>By keeping <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> fixed and keeping in mind that both it and <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> is positive we get the following function plot. Note that <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> is unnaturally large for illustration purposes.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-pb-fixed-300x206.png" alt="JLH-pb-fixed" width="300" height="206" class="alignnone size-medium wp-image-3792"></a></p>
<p>On the face of it this looks bad for a scoring function. It can be undesirable that it changes sign, but more troublesome is the fact that this function is not monotonically increasing.</p>
<p>The gradient of the function:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cnabla+JLH%28p_%7Bfore%7D%2C+p_%7Bback%7D%29+%3D+%5Cleft%28%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D+-+1%7D+%2C+-%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%5E2%7D%5Cright%29++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' title='  \nabla JLH(p_{fore}, p_{back}) = \left(\frac{2 p_{fore}}{p_{back} - 1} , -\frac{p_{fore}^2}{p_{back}^2}\right)  ' class='latex' />
<p>Setting the gradient to zero we see by looking at the second coordinate that the JLH does not have a minimum, but approaches it when <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> approaches zero where the function is undefined. While the second coordinate is always positive, the first coordinate shows us where the function is not increasing.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++%5Cfrac%7B2+p_%7Bfore%7D%7D%7Bp_%7Bback%7D%7D++-+1+%26+%3C+0+%5C%5C++p_%7Bfore%7D+%26+%3C+%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' title='  \begin{aligned}  \frac{2 p_{fore}}{p_{back}}  - 1 &amp; &lt; 0 \\  p_{fore} &amp; &lt; \frac{1}{2}p_{back}  \end{aligned}  ' class='latex' />
<p>Furtunately the decreasing part of the function is in an area where <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D+-+p_%7Bback%7D+%3C+0&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore} - p_{back} &lt; 0' title='p_{fore} - p_{back} &lt; 0' class='latex' /> and the JLH score explicitly defined as zero. By symmetry of the square around the minimum of the first coordinate of the gradient around <img src='http://s0.wp.com/latex.php?latex=%5Cfrac%7B1%7D%7B2%7Dp_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\frac{1}{2}p_{back}' title='\frac{1}{2}p_{back}' class='latex' /> we also see that the entire area where the score is below zero is in this region.</p>
<p>With this it seems sensible to just drop the linear term of the JLH score and just use the quadratic part. This will result in the same ranking with a slightly less steep increase in score as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases.</p>
<img src='http://s0.wp.com/latex.php?latex=++JLH_%7Bmod%7D+%3D+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' title='  JLH_{mod} = \frac{p_{fore}^2}{p_{back}}  ' class='latex' />
<p>Looking at the level sets for the JLH score there is a quadratic relationship between the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. Solving for a fixed level <img src='http://s0.wp.com/latex.php?latex=k&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='k' title='k' class='latex' /> we get:</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+-+p_%7Bfore%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D%5E2+-+p_%7Bfore%7D+-+k%5Ccdot+p_%7Bback%7D++%3D+0+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Cfrac%7B1%7D%7B2%7D+%5Cpm+%5Cfrac%7B%5Csqrt%7B1+%2B+4+%5Ccdot+k+%5Ccdot+p_%7Bback%7D%7D%7D%7B2%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} - p_{fore} = k \\   &amp; p_{fore}^2 - p_{fore} - k\cdot p_{back}  = 0 \\   &amp; p_{fore} = \frac{1}{2} \pm \frac{\sqrt{1 + 4 \cdot k \cdot p_{back}}}{2}  \end{aligned}  ' class='latex' />
<p>Where the negative part is outside of function definition area.<br />
This is far easier to see in the simplified formula.</p>
<img src='http://s0.wp.com/latex.php?latex=++%5Cbegin%7Baligned%7D++JLH+%3D+%26+%5Cfrac%7Bp_%7Bfore%7D%5E2%7D%7Bp_%7Bback%7D%7D+%3D+k+%5C%5C+++%26+p_%7Bfore%7D+%3D+%5Csqrt%7Bk+%5Ccdot+p_%7Bback%7D%7D++%5Cend%7Baligned%7D++&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' title='  \begin{aligned}  JLH = &amp; \frac{p_{fore}^2}{p_{back}} = k \\   &amp; p_{fore} = \sqrt{k \cdot p_{back}}  \end{aligned}  ' class='latex' />
<p>An increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> must be offset by approximately a square root increase in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to  retain the same score.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-contour-300x209.png" alt="JLH-contour" width="300" height="209" class="alignnone size-medium wp-image-3791"></a></p>
<p>As we see the score increases sharply as <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> increases in a quadratic manner against <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' />. As <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> becomes small compared to <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> the growth goes from linear in <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> to squared.</p>
<p>Finally a 3D plot of the score function.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/JLH-3d-300x203.png" alt="JLH-3d" width="300" height="203" class="alignnone size-medium wp-image-3790"></a></p>
<p>So what can we take away from all this? I think the main practical consideration is the squared relationship between <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> which means once there is significant difference between the two the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> will dominate the score ranking. The <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> factor primarily makes the score sensitive when this factor is small and for reasonable similar <img src='http://s0.wp.com/latex.php?latex=p_%7Bback%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{back}' title='p_{back}' class='latex' /> the <img src='http://s0.wp.com/latex.php?latex=p_%7Bfore%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='p_{fore}' title='p_{fore}' class='latex' /> decides the ranking. There are some obvious consequences from this which would be interesting to explore in real data. First that you would like to have a large background document set if you want more fine grained sensitivity to background frequency. Second, foreground frequencies can dominate the score to such an extent that peculiarities of the implementation may show up in the significant terms ranking, which we will look at in more detail as we try to apply the significant terms aggregation to single documents.</p>
<p>The results and visualizations in this blog post is also available as an <a href="https://github.com/andrely/ipython-notebooks/blob/master/JLH%20score%20characteristics.ipynb" title="JLH score characteristics">iPython notebook</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/06/10/how-elasticsearch-calculates-significant-terms/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Impressions from Berlin Buzzwords 2015</title>
		<link>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/#comments</comments>
		<pubDate>Mon, 08 Jun 2015 13:34:53 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[bbuzz]]></category>
		<category><![CDATA[berlin buzzwords]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Kafka]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3720</guid>
		<description><![CDATA[May 31 &#8211; June 3 2015 Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. Berlin Buzzwords undoubtedly lives up to its name by presenting the frontlines of data technology trends. The conference is focused on three core concepts &#8211; search, data and scale, bringing together a diverse range of people [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/andre-bbuzz-beyond-significant-terms.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/andre-bbuzz-beyond-significant-terms-300x194.png" alt="andre-bbuzz-beyond-significant-terms" width="300" height="194" class="alignright size-medium wp-image-3741" /></a>May 31 &#8211; June 3 2015</p>
<p></a>Stream processing, Internet of things, Real time analytics, Big data, Recommendations, Machine learning. <a href="http://berlinbuzzwords.de/">Berlin Buzzwords</a> undoubtedly lives up to its name by presenting the frontlines of data technology trends.<br />
<span id="more-3720"></span><br />
The conference is focused on three core concepts &#8211; search, data and scale, bringing together a diverse range of people and with presentations touching the perimeter of the buzzword range.<br />
Berlin Buzzwords kicked off on Sunday evening with a Barcamp, Monday and Tuesday contained full day conferences, while Wednesday was filled with hackathons and workshops.</p>
<h3>Comperio</h3>
<p>Comperio was one of the many companies sponsoring the conference, and came to Berlin bringing two speakers. André Lynum talked about “Beyond Significant terms” &#8211; a deep dive into how to utilize Elasticsearch built in indexes and APIs  for improved lexical analysis, topic management and trend information. André’s talk went far beyond what the well known Elasticsearch significant terms aggregation provides. Christoffer Vig captured a spot on the informal Open Stage, giving a funny and off-kilter presentation and demo of the analytics and visualization capabilities of Kibana 4 based on a beer product catalogue.</p>
<h3>The talks</h3>
<p>Many people attended the comparison of Solr and Elasticsearch Performance &#038; Scalability with Radu Gheorghe &#038; Rafał Kuć from Sematext. This was a fast paced run through of how they were able to create tests reproducing the same conditions on both search engines. Elasticsearch outperformed Solr on text search using wikipedia data, while, surprisingly Solr outperformed Elasticsearch on aggregations. Solr has recently started catching up with Elasticsearch on providing nested aggregations and perhaps the improved performance comes as a result of a slimmed down implementation? It will be very interesting to follow the developments of both platforms into the future, and as consumers of the products we see competition is a good thing driving innovation and performance.</p>
<p>Two other interesting technical talks was Adrian Grands explaining some of the algorithms behind Elasticsearchs aggregations and Ted Dunnings presentation of the t-digest algorithm. Both were a window into how approximations can yield fast algorithms for complex statistics with provable bounds which they managed to keep approachable to the casual listener.</p>
<h3>SQL?</h3>
<p>Another theme threatening to return from the basement was how to properly support SQL style joins into search engines.  Real life use cases sometimes demand objects with relations. The stock answer from the NoSQL world is to denormalize your data before inserting it, but Lucene/Elasticsearch/Solr did get limited Join support a while ago. Taking this further Mikhail Khludnev showed how the new Global Ordinal Join aims to provide a Join with improved performance.</p>
<h3>Talking the talk</h3>
<p>As search consultants one of our main challenges at Comperio is communicating about technical topics with customers who need to connect technical topics to their own competence and background. Ellen Friedman from MapR explained how such communication can be beneficial to almost any team or team member and shared some experiences and ideas regarding how you can try this at home. At its core it boils down to understanding and describing your technical work across several layers and showing respect for the perspective and background your conversation partner.<br />
She also shared a very funny parrot joke. Not going to reveal that one here, watch the video if you’ld like a good laugh.</p>
<h3>Hackathon</h3>
<p>Comperio also attended the Apache Flink workshop hosted at Google’s offices in Berlin by the talented developers at data Artisans. Apache Flink is in some ways similar to Apache Spark and other recent distributed computing frameworks, and is an alternative to Hadoop&#8217;s MapReduce component.  It represents a novel approach to data processing, modelling all data as streams, exposing both a batch- and stream APIs. Apache Flink has a built in optimizer that optimizes memory, network traffic and processing power. This leaves the developer to implement core functionality in Java, Scala or Python.</p>
<h3>The buzz</h3>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/06/berlinbuzzwordsLogo.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/06/berlinbuzzwordsLogo-300x176.png" alt="berlinbuzzwordsLogo" width="300" height="176" class="alignright size-small wp-image-3726" /></a><br />
Berlin Buzzwords is a great opportunity to surf the crest of the big data wave with the most interesting people in the field. The city of Berlin with it’s sense of being on the edge of new developments provides the perfect backdrop for a conference on the latest “Buzzwords”. Comperio will certainly be back next year.</p>
<p>Videos from most talks are available at <a href="https://www.youtube.com/playlist?list=PLq-odUc2x7i-_qWWixXHZ6w-MxyLxEC7s">youtube.com</a></p>
<p><b>Beyond significant terms</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/yYFFlyHPGlg?feature=oembed" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></p>
<p><b>Algorithms and data-structures that power Lucene and Elasticsearch</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/eQ-rXP-D80U?feature=oembed" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe></p>
<p><b>Practical t-digest Applications</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/CR4-aVvjE6A?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><b>Talk the Talk: How to Communicate with the Non-Coder</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/Je-X850t_L8?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p><b>Side by Side with Elasticsearch &#038; Solr part 2</b></p>
<p><iframe width="500" height="281" src="https://www.youtube.com/embed/01mXpZ0F-_o?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/06/08/impressions-from-berlin-buzzwords-2015/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Analyzing web server logs with Elasticsearch in the cloud</title>
		<link>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/#comments</comments>
		<pubDate>Tue, 26 May 2015 21:12:34 +0000</pubDate>
		<dc:creator><![CDATA[Christoffer Vig]]></dc:creator>
				<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[found by elastic]]></category>
		<category><![CDATA[Kibana]]></category>
		<category><![CDATA[logstash]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3702</guid>
		<description><![CDATA[Using Logstash and Kibana on Found by Elastic, Part 1 This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and [...]]]></description>
				<content:encoded><![CDATA[<h2>Using Logstash and Kibana on Found by Elastic, Part 1</h2>
<p>This is part one of a two post blog series, aiming to demonstrate how to feed logs from IIS into Elasticsearch and Kibana via Logstash, using the hosted services provided by Found by Elastic. This post will deal with setting up the basic functionality and securing connections. Part 2 will show how to configure Logstash to read from IIS log files, and how to use Kibana 4 to visualize web traffic. Originally published on the <a href="https://www.found.no/foundation/analyzing-weblogs-with-elasticsearch/">Elastic Blog</a><br />
<span id="more-3702"></span></p>
<h4>Getting the Bits</h4>
<p>For this demo I will be running Logstash and Kibana from my Windows laptop.<br />
If you want to follow along, download and extract Logstash 1.5.RC4 or later, and Kibana 4.0.2 or later from <a href="https://www.elastic.co/downloads">https://www.elastic.co/downloads</a>.</p>
<h4>Creating an Elasticsearch Cluster</h4>
<p>Creating a new trial cluster in Found is just a matter of logging in and pressing a button. It takes a few seconds until the cluster is ready, and a screen with some basic information on how to connect pops up. We need the address for the HTTPS endpoint, so copy that out.</p>
<h4>Configuring Logstash</h4>
<p>Now, with the brand new SSL connection option in Logstash, connecting to Found is as simple as this Logstash configuration</p><pre class="crayon-plain-tag">input { stdin{} }

output {
  elasticsearch {
    protocol =&gt; http
    host =&gt; REPLACE_WITH_FOUND_CLUSTER_HOSTNAME
    port =&gt; "9243" # Check the port also
    ssl =&gt; true
  }

  stdout { codec =&gt; rubydebug }
}</pre><p>&nbsp;</p>
<p>Save the file as found.conf</p>
<p>Start up Logstash using</p><pre class="crayon-plain-tag">bin\logstash.bat agent --verbose -f found.conf</pre><p>You should see a message similar to</p><pre class="crayon-plain-tag">Create client to elasticsearch server on `https://....foundcluster.com:9243`: {:level=&amp;gt;:info}</pre><p>Once you see &#8220;Logstash startup completed&#8221; type in your favorite test term on the terminal. Mine is &#8220;fisk&#8221; so I type that.<br />
You should see output on your screen showing what Logstash intends to pass on to elasticsearch.</p>
<p>We want to make sure this actually hits the cloud, so open a browser window and paste the HTTPS link from before, append <code>/_search</code> to the URL and hit enter.<br />
You should now see the search results from your newly created Elasticsearch cluster, containing the favorite term you just typed in. We have a functioning connection from Logstash on our machine to Elasticsearch in the cloud! Congratulations!</p>
<h4>Configuring Kibana 4</h4>
<p>Kibana 4 comes with a built-in webserver. The configuration is done in a kibana.yml file in the config directory. Connecting to Elasticsearch in the cloud comes down to inserting the address of the Elasticsearch instance.</p><pre class="crayon-plain-tag"># The Elasticsearch instance to use for all your queries.
elasticsearch_url: `https://....foundcluster.com:9243`</pre><p>Of course, we need to verify that this really works, so we open up Kibana on <a href="http://localhost:5601">http://localhost:5601</a>, select the Logstash index template, with the @timestamp data field as suggested, and open up the discover panel. Now, if there was less than 15 minutes since you inserted your favorite test term in Logstash (previous step), you should see it already. Otherwise, change the date range by clicking on the selector in the top right corner.</p>
<p><img class="alignleft" src="https://raw.githubusercontent.com/babadofar/MyOwnRepo/master/images/kibanatest.png" alt="Kibana test" width="1090"  /></p>
<h4>Locking it down</h4>
<p>Found by Elastic has worked hard to make the previous steps easy. We created an Elasticsearch cluster, fed data into it and displayed in Kibana in less than 5 minutes. We must have forgotten something!? And yes, of course! Something about security. We made sure to use secure connections with SSL, and the address generated for our cluster contains a 32 character long, randomly generated list of characters, which is pretty hard to guess. Should, however, the address slip out of our hands, hackers could easily delete our entire cluster. And we don’t want that to happen. So let’s see how we can make everything work when we add some basic security measures.</p>
<h4>Access Control Lists</h4>
<p>Found by Elastic has support for access control lists, where you can set up lists of usernames and passwords, with lists of rules that deny/allow access to various paths within Elasticsearch. This makes it easy to create a &#8220;read only&#8221; user, for instance, by creating a user with a rule that only allows access to the <code>/_search</code> path. Found by Elastic has a sample configuration with users searchonly and readwrite. We will use these as starting point but first we need to figure out what Kibana needs.</p>
<h4>Kibana 4 Security</h4>
<p>Kibana 4 stores its configuration in a special index, by default named &#8220;.kibana&#8221;. The Kibana webserver needs write access to this index. In addition, all Kibana users need write access to this index, for storing dashboards, visualizations and searches, and read access to all the indices that it will query. More details about the access demands of Kibana 4 can be found on the <a href="http://www.elastic.co/guide/en/shield/current/_shield_with_kibana_4.html">elastic blog</a>.</p>
<p>For this demo, we will simply copy the “readwrite” user from the sample twice, naming one kibanaserver, the other kibanauser.</p><pre class="crayon-plain-tag">Setting Access Control in Found:
# Allow everything for the readwrite-user, kibanauser and kibanaserver
- paths: ['.*']
conditions:
- basic_auth:
users:
- readwrite
- kibanauser
- kibanaserver
- ssl:
require: true
action: allow</pre><p>Press save and the changes are immediately effective. Try to reload the Kibana at <a href="http://localhost:5601">http://localhost:5601</a>, you should be denied access.</p>
<p>Open up the kibana.yml file from before and modify it:</p><pre class="crayon-plain-tag"># If your Elasticsearch is protected with basic auth, this is the user credentials
# used by the Kibana server to perform maintence on the kibana_index at statup. Your Kibana
# users will still need to authenticate with Elasticsearch (which is proxied thorugh
# the Kibana server)
kibana_elasticsearch_username: kibanaserver
kibana_elasticsearch_password: `KIBANASERVER_USER_PASSWORD`</pre><p>Stop and start Kibana to effectuate settings.<br />
Now when Kibana starts up, you will be presented with a login box for HTTP authentication.<br />
Type in kibanauser as the username, and the password . You should now again be presented with the Discover screen, showing the previously entered favorite test term. Again, you may have to expand the time range to see your entry.</p>
<h4>Logstash Security</h4>
<p>Logstash will also need to supply credentials when connecting to Found by Elastic. We reuse permission from the readwrite user once again, this time giving the name &#8220;logstash&#8221;.<br />
It is simply a matter of supplying the username and password in the configuration file.</p><pre class="crayon-plain-tag">output {
  elasticsearch {
    ….
    user =&gt; “logstash”,
    password =&gt; `LOGSTASH_USER_PASSWORD`
  }
}</pre><p></p>
<h4>Wrapping it up</h4>
<p>This has been a short dive into Logstash and Kibana with Found by Elastic. The recent changes done in order to support the Shield plugin for Elasticsearch, Logstash and Kibana, make it very easy to use the secure features of Found by Elastic. In the next post we will look into feeding logs from IIS into Elasticsearch via Logstash, and visualizing the most used query terms in Kibana.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/05/26/analyzing-weblogs-with-elasticsearch-in-the-cloud/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Enhancing Web Analytics with Search Analytics</title>
		<link>http://blog.comperiosearch.com/blog/2015/05/20/enhancing-web-analytics-search-analytics/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/05/20/enhancing-web-analytics-search-analytics/#comments</comments>
		<pubDate>Wed, 20 May 2015 12:43:27 +0000</pubDate>
		<dc:creator><![CDATA[Mridu Agarwal]]></dc:creator>
				<category><![CDATA[Business]]></category>
		<category><![CDATA[English]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Google Analytics]]></category>
		<category><![CDATA[Search Analytics]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3694</guid>
		<description><![CDATA[Web Analytics is the process of measuring and analyzing web data to assess and improve the effectiveness of a website.Tracking and improving search (search analytics) is an important part of web analytics which is often forgotten by many site owners. Website search analytics should not be underestimated as it can provide valuable insights into what [...]]]></description>
				<content:encoded><![CDATA[<p>Web Analytics is the process of measuring and analyzing web data to assess and improve the effectiveness of a website.Tracking and improving search (search analytics) is an important part of web analytics which is often forgotten by many site owners. Website search analytics should not be underestimated as it can provide valuable insights into what users are looking for or what they are not able to find on the site. Recently, I read somewhere about an organization which increased their conversion rates by just increasing the size of their search box and working on the searches with zero results. Therefore, measuring and analyzing search could be a very important aspect in improving website&#8217;s effectiveness.</p>
<p>In this post, I will show you 5 quick steps to get started with Search Analytics:</p>
<p><strong>1. Get Search Logs: </strong>There are various tools you can use to track and analyze your search. You many choose any one of them to get started depending on your business domain and organizational policies. I am using Google Analytics as a tool for this post simply because not only it is very powerful but it is very easy to start with. You can just create a google account and get started for &#8220;free&#8221;. No infrastructure setup required. Place the JavaScript code in your website pages and you are ready to start measuring.</p>
<p>Please note that there are many other tools in the market for more specific purposes and with higher complexities. It really depends on what your needs are. Here, we will continue with Google Analytics.</p>
<p><strong>2. Understand your Site Search Usage</strong>: Many people underestimate the use of search on their website. So, the first step is to measure how many people are using search. Even if 5-10% of your visitors are using search, it is not a bad figure (depending on your business domain and site setup). This could very well be the most used navigation on your website.</p>
<p><img class="alignnone wp-image-3697 " src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/1-1024x282.png" alt="Site Search Usage" width="544" height="150" /></p>
<p><strong>3. Analyze your Search Terms:</strong> Get a list of searched terms to start with. Most often, you will see that there is a pattern. Few terms are searched more than the others. It is important to analyze the top 10 searched terms individually. You might want to group the other terms (long tail) and analyze them separately.</p>
<p>&nbsp;</p>
<p><img class="alignnone size-full wp-image-3698" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/2.png" alt="Search Terms" width="188" height="165" /></p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/3.png"><img class="alignnone size-full wp-image-3699" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/3.png" alt="Searched Terms List" width="255" height="219" /></a></p>
<p style="text-align: left">Try conducting searches for these searched term yourself and see if you are satisfied with the results. Did you get what you were expecting as the number one result? If not, you might want to make changes to your site to improve your search results.</p>
<p style="text-align: left">Some additional things to ponder:</p>
<p style="text-align: left">Looking at the searched terms do you see what you were expecting? Are their unknown terms?For example, if you have a product support site you might expect users to search more for some newly launched products.Do you see the product name or numbers in your searched term report? Are people looking for product comparison?</p>
<p style="text-align: left">If you have some unexpected terms in your top 10 searched terms, then you might want to consider adding additional content related to those terms.</p>
<p style="text-align: left"><strong>4. Evaluate User Experience: </strong>Are users happy or frustrated by the time they leave the site? Can they find what they are looking for or are they leaving immediately after performing a search? This is the toughest part because you are not sitting with the user and you can make only as much sense as the reports could tell. But the good news is that there are some metrics to watch out for which can provide valuable insights related to user experience. Couple of these metrics is shown in the picture below.</p>
<p> <img class="alignnone  wp-image-3700" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/4-300x139.png" alt="Search Metrics" width="339" height="157" /></p>
<p><strong>Result Pageviews/Search</strong> tells you about the number of results user viewed for the search term. If the number is too high you know that it is taking too long for users to find what they are looking for and they will not be happy about it.</p>
<p><strong>% Search Exits</strong> is equivalent to bounce rate in web analytics. This number tells you about the percentage of people who left immediately after performing a search for that term without clicking on any of the search results. We would want this number to be as low as possible.</p>
<p>It is also important to evaluate <strong>search terms that produced 0 results</strong>. Users will not be happy to find zero results for their searches. There is no out of box metric in Google Analytics to track this but there are various ways to get around it using Events or Custom variables.</p>
<p><strong>5. Improve Search Experience: </strong>What would you do if you found that there is a <strong>35% Search Exit</strong> for your top keyword or <strong>20</strong> <strong>Result Pageviews/Search? </strong>In some cases this might be because the term is not spelled correctly or simply because user is using a term which is not an exact match with your content. For example, in an intranet environment of an organization, an employee is searching for &#8220;vacation list&#8221; but not getting any hits because you have &#8220;holiday list&#8221; in your content. Here, you might consider adding synonyms or best bets for these frequently search terms. Adding spelling correction like &#8220;Did you mean&#8221; or providing &#8220;related searches&#8221; for results could further help improve user experience and keep the visitors engaged in your website. If there are zero results for a search term you might want to consider adding additional content as well.</p>
<p>To conclude, I would say that there are lot of ways in which you can improve your site search analytics but the important part is to get started. It is not as tough as it may sound and it is worth the effort considering the amount of valuable information you get and the direct insight into user&#8217;s mind.</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/05/20/enhancing-web-analytics-search-analytics/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Search: better user experience with one line of JavaScript</title>
		<link>http://blog.comperiosearch.com/blog/2015/05/18/search-better-user-experience-with-one-line-of-javascript/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/05/18/search-better-user-experience-with-one-line-of-javascript/#comments</comments>
		<pubDate>Mon, 18 May 2015 13:59:01 +0000</pubDate>
		<dc:creator><![CDATA[Espen Klem]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[enterprise search]]></category>
		<category><![CDATA[focus]]></category>
		<category><![CDATA[Javascript]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search box]]></category>
		<category><![CDATA[user experience]]></category>
		<category><![CDATA[ux]]></category>
		<category><![CDATA[website search]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3680</guid>
		<description><![CDATA[What&#8217;s the cheapest trick you can do to get a better user experience on your search solution, and make your users do better search queries? Add a small line of JavaScript in your template&#8217;s document ready function: [crayon-69ea31135f82b135246439/] This will do two things for the user: It&#8217;ll be easier to see the search box . [...]]]></description>
				<content:encoded><![CDATA[<p>What&#8217;s the cheapest trick you can do to get a better user experience on your search solution, and make your users do better search queries?</p>
<p><img class="alignnone wp-image-3683 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/search-box.png" alt="Illustration: A standard search box" width="598" height="66" /></p>
<p>Add a small line of JavaScript in your template&#8217;s document ready function:</p><pre class="crayon-plain-tag">$("#MySearchBox").focus();</pre><p>This will do two things for the user:</p>
<ol>
<li>It&#8217;ll be easier to see the search box .</li>
<li>The user can start typing without having to click inside the search box.</li>
</ol>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-just-cursor.gif"><img class="alignnone wp-image-3682 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-just-cursor.gif" alt="Illustration: Better user experience ny setting focus on the search box" width="598" height="66" /></a></p>
<p>Next issue is that most intranet and websites are more than just a search solution. Maybe you don&#8217;t want that much attention on the search box on your homepage. The solution is then to do this on your search result page.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-one-word.gif"><img class="alignnone wp-image-3685 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-one-word.gif" alt="Illustration: Better user experience ny setting focus on the search box" width="598" height="66" /></a></p>
<p>This will make it easier for your users to enhance their search query when they&#8217;re not happy with the search result at hand.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-more-words.gif"><img class="alignnone wp-image-3684 size-full" src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/anim-more-words.gif" alt="Illustration: Better user experience ny setting focus on the search box" width="598" height="66" /></a></p>
<p>Do you have any other examples on other quick fixes that could make an even better user experience for your search solution?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/05/18/search-better-user-experience-with-one-line-of-javascript/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Using iPython notebooks and Pycharm together</title>
		<link>http://blog.comperiosearch.com/blog/2015/05/11/using-ipython-notebooks-and-pycharm-together/</link>
		<comments>http://blog.comperiosearch.com/blog/2015/05/11/using-ipython-notebooks-and-pycharm-together/#comments</comments>
		<pubDate>Mon, 11 May 2015 11:12:24 +0000</pubDate>
		<dc:creator><![CDATA[André Lynum]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[Technology]]></category>
		<category><![CDATA[autoreload]]></category>
		<category><![CDATA[ipython notebook]]></category>
		<category><![CDATA[pycharm]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3626</guid>
		<description><![CDATA[IPython notebooks have become an indispensable tool for many Python developers. They are a reasonably good environment for interactive computing, can contain inline data visualisations and can be hosted remotely for sharing results or working together with other developers. In many academic environments and increasingly in industry IPython notebooks are used for data visualisation work [...]]]></description>
				<content:encoded><![CDATA[<p><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-7.png" alt="ipython-blog-7" width="282" height="70" class="aligncenter size-large wp-image-3628" /><br />
<a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-6.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-6.png" alt="ipython-blog-6" width="600" class="aligncenter size-full wp-image-3627" /></a></p>
<p />
IPython notebooks have become an indispensable tool for many Python developers. They are a reasonably good environment for interactive computing, can contain inline data visualisations and can be hosted remotely for sharing results or working together with other developers. In many academic environments and increasingly in industry IPython notebooks are used for data visualisation work and exploratory programming, depending on the IPython interactive environment for fast prototyping of ideas.</p>
<p />
As nice an environment we have in IPython, I often wish for the features of a full-fledged IDE. Here at Comperio we use PyCharm a lot which has excellent code editing, semantic completion, a graphical debugger and efficient code navigaton capabilities. In this blog post I’m going to show how you can simultaneously work on code in both the IDE and IPython notebook or interactive shell while keeping the running notebook and IDE project in sync.</p>
<p />
Hey, PyCharm already have IPython notebook integration. What about that? Personally I find that the IPython notebook integration in the latest PyCharm (version 4.0.6)  still isn’t adequate for serious work. You get the the completion and code navigation from PyCharm, but editing and navigation is reduced to half a dozen buttons. Further some functionality such as debugging appears to be plainly non-functional. Regardless there are other very nice IDEs for Python such as Wing or Eclipse, and the approach here will work equally well with them.</p>
<p />
This cunning recipe consists of two spicy ingredients, Both are neat tricks on their own, but together they form a smooth workflow bridging exploratory programming and more structured software engineering. We are going to:</p>
<p />
<ul>
<li>Install our code as an editable Pip package.</li>
<li>Use the IPython autoreload extension to dynamically reload code.</li>
</ul>
<p />
So let’s get cooking!</p>
<p />
<h2>Editable Pip packages</h2>
<p />
We are going to organise our code in a Python package and install it with Pip using the <code>-e</code> or <code>—editable</code> option. This installs the package as it is pointing to our project directory and that we are always importing the code that we are editing. We could also accomplish this with some hacking on <code>sys.path</code> or <code>PYTHONPATH</code>, but having our code available as a package is a lot more seamless. It makes sense to use <code>virtualenv</code> (or EPD/Anaconda environments) to isolate your system Python from your development packages.</p>
<p />
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-1.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-1-300x190.png" alt="ipython-blog-1" width="300" height="190" class="aligncenter size-medium wp-image-3620" /></a></p>
<p />
First we create Python project in PyCharm, add source folder with <code>setup.py</code> defining a basic python package.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-2.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-2-300x194.png" alt="ipython-blog-2" width="300" height="194" class="aligncenter size-medium wp-image-3621" /></a></p>
<p />
Then we create stub file with the following code in our python module and a folder for our notebooks.</p>
<p />
<p></p><pre class="crayon-plain-tag">def get_page():
    print "Don't know how to do this yet."</pre><p> </p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-3.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-3-300x158.png" alt="ipython-blog-3" width="300" height="158" class="aligncenter size-medium wp-image-3622" /></a></p>
<p />
And we activate our <code>virtualenv</code>/Conda environment and run <code>pip install -e </code>.</p>
<p />
<strong>Pip and Git: </strong>If you install your package with <code>-e</code> from a Git repository it may think that you want to install from Git even if you&#8217;re giving it a file path. This is usually not what you want when you&#8217;re developing since you would have to commit your code for the package to update itself. An ugly but practical way to avoid this behavior is to move the .git folder out of the way when installing the package.</p>
<p />
<h2>%autoreloading code</h2>
<p />
Now to the important part which is dynamically loading the IDE project into our IPython notebook. Let’s first fire up the notebook.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-4.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-4-300x71.png" alt="ipython-blog-4" width="300" height="71" class="aligncenter size-medium wp-image-3623" /></a></p>
<p />
Start iPython and create a notebook.</p>
<p />
You have probably used reload(module) to update the Python environment at runtime. This hardly ever works for more than five minutes and results in an inscrutable mess of old new stuff in your modules and classes.There are however a bag of neat tricks taking care of at least the majority of the problems around reloading Python code or modules (see http://pyunit.sourceforge.net/notes/reloading.html),  and the IPython developers has collected these into their autoreload extension. Let’s look at it in action.</p>
<p><a href="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-5.png"><img src="http://blog.comperiosearch.com/wp-content/uploads/2015/05/ipython-blog-5-300x137.png" alt="ipython-blog-5" width="300" height="137" class="aligncenter size-medium wp-image-3624" /></a></p>
<p />
Here we set up the autoreload module and import our stub function in the first two cells. In the third we run our function. We then change the function definition in the IDE and save the file.</p>
<p></p><pre class="crayon-plain-tag">def get_page():
    print "Hey I'm updated."</pre><p> </p>
<p />
<p>And when we run the function again in cell four the updated code is run.</p>
<p />
<h2>From notebook to project and back</h2>
<p />
Combining editable Pip packages and the autoreload module we have a way to seamlessly load our project code in the notebook. When we are ready to move our exploratory programming back to project we can move our code over, import any new definitions and refine our implementation while using it in our notebook. In this way we can quickly move from noodling around in the notebook to developing and testing in the IDE and move back to the notebook to use our project code in further unstructured meanderings.</p>
<p />
In the next post we will demonstrate this in more detail.</p>
<p />
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2015/05/11/using-ipython-notebooks-and-pycharm-together/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
