<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Search Nuggets &#187; crawl</title>
	<atom:link href="http://blog.comperiosearch.com/blog/tag/crawl/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.comperiosearch.com</link>
	<description>A blog about Search as THE solution</description>
	<lastBuildDate>Mon, 13 Jun 2016 08:59:45 +0000</lastBuildDate>
	<language>en-US</language>
		<sy:updatePeriod>hourly</sy:updatePeriod>
		<sy:updateFrequency>1</sy:updateFrequency>
	<generator>https://wordpress.org/?v=3.9.40</generator>
	<item>
		<title>Idea: Your life searchable through Norch &#8211; NOde seaRCH, IFTTT and Google Drive</title>
		<link>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/#comments</comments>
		<pubDate>Wed, 26 Nov 2014 14:33:08 +0000</pubDate>
		<dc:creator><![CDATA[Espen Klem]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[Document Processing]]></category>
		<category><![CDATA[Elasticsearch]]></category>
		<category><![CDATA[Google Drive]]></category>
		<category><![CDATA[IFTTT]]></category>
		<category><![CDATA[Index]]></category>
		<category><![CDATA[Json]]></category>
		<category><![CDATA[Life Index]]></category>
		<category><![CDATA[Lifeindex]]></category>
		<category><![CDATA[node]]></category>
		<category><![CDATA[Node Search]]></category>
		<category><![CDATA[node.js]]></category>
		<category><![CDATA[nodejs]]></category>
		<category><![CDATA[norch]]></category>
		<category><![CDATA[Personal Search Engine]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[search engine]]></category>
		<category><![CDATA[search-index]]></category>
		<category><![CDATA[sharepoint]]></category>
		<category><![CDATA[Small Data]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[technology]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=3069</guid>
		<description><![CDATA[First some disclaimers: This has been posted earlier on lab.klemespen.com. Even though some of these ideas are not what you&#8217;d normally implement in a business environment, some of the concepts can obviously be transferred over to businesses trying to provide an efficient workplace for its employees. Norch is developed by Fergus McDowall, an employee of [...]]]></description>
				<content:encoded><![CDATA[<p><strong>First some disclaimers</strong>:</p>
<ul>
<li>This has been posted earlier on <a href="http://lab.klemespen.com/2014/11/25/idea-your-life-searchable-with-norch-node-search-ifttt-and-google-drive-spreadsheets/">lab.klemespen.com</a>.</li>
<li>Even though some of these ideas are not what you&#8217;d normally implement in a business environment, some of the concepts can obviously be transferred over to businesses trying to provide an efficient workplace for its employees.</li>
<li><a href="https://github.com/fergiemcdowall/norch">Norch</a> is developed by <a href="http://blog.comperiosearch.com/blog/author/fmcdowall/">Fergus McDowall</a>, an employee of Comerio.</li>
</ul>
<p>What if you could index your whole life and make this lifeindex available through search? What would that look like, and how could it help you? Refinding information is obviously one of the use case for this type of search. I&#8217;m guessing there&#8217;s a lot more, and I&#8217;m curious to figure them out.</p>
<h2>Actions and reactions instead of web pages</h2>
<p>I had the lifeindex idea for a little while now. Originally the idea was to index everything I browsed. From what I know and where <a href="https://github.com/fergiemcdowall/norch">Norch</a> is, it would take a while before I was anywhere close to achieving that goal. <a href="http://codepen.io/nickmoreton/blog/using-ifttt-and-google-drive-to-create-a-json-api">Then I thought of IFTTT</a>, and saw it as a &#8216;next best thing&#8217;. But then it hit me that now I&#8217;m indexing actions, and that&#8217;s way better than pages. But what I&#8217;m missing from most sources now are the reactions to my actions. If I have a question, I also want to crawl and index the answer. If I have a statement, I want to get the critique indexed.<span id="more-3069"></span></p>
<p>IFTTT and similar services (like Zapier) is quite limiting in their choice of triggers. Not sure if this is because of choices done by those services or limitations from the sites they crawl/pull information from.</p>
<p>A quick fix for this, and a generally good idea for Search Engines, would be to switch from a preview of your content to the actual content in the form of an embed-view. Here exemplified:</p>
<blockquote class="twitter-tweet" data-width="500"><p lang="en" dir="ltr">Will embed-view of your content replace the preview-pane in modern <a href="https://twitter.com/hashtag/search?src=hash&amp;ref_src=twsrc%5Etfw">#search</a>  <a href="https://twitter.com/hashtag/engine?src=hash&amp;ref_src=twsrc%5Etfw">#engine</a> solutions? Why preview when you can have the real deal?</p>
<p>&mdash; Espen Klem (@eklem) <a href="https://twitter.com/eklem/status/536866049078333440?ref_src=twsrc%5Etfw">November 24, 2014</a></p></blockquote>
<p><script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></p>
<h2>Technology: Hello IFTTT, Google SpreadSheet and Norch</h2>
<p>IFTTT is triggered by my actions, and stores some data to a series of spreadsheets on Google Drive. <a href="http://jsonformatter.curiousconcept.com/#https://spreadsheets.google.com/feeds/list/1B-OFzKIMVNk_3xMX_jBToGGyxSKv6FoyFYTHpGEy5O0/od6/public/values?alt=json">These spreadsheets can deliver JSON</a>. After a little document processing these JSON-files can be fed to the <a href="https://github.com/fergiemcdowall/norch#norch-indexer">Norch-indexer</a>.</p>
<h2>Why hasn&#8217;t this idea popped up earlier?</h2>
<p>Search engines used to be hardware guzzling technology. With Norch, the &#8220;NOde seaRCH&#8221; engine, that has changed. Elasticsearch and Solr are easy and small compared to i.e. SharePoint Search, but still it needs a lot of hardware. Norch can run on a Raspberry Pi, and soon it will be able to run in your browser. Maybe data sets closer to <a href="http://en.wikipedia.org/wiki/Small_data">small data</a> is more interesting than <a href="http://en.wikipedia.org/wiki/Big_data">big data</a>?</p>
<p><a href="http://youtu.be/ijLtk5TgvZg"><img src="http://blog.comperiosearch.com/wp-content/uploads/2014/11/Screen-Shot-2014-11-26-at-16.42.27-300x180.png" alt="Video: Norch running on a Raspberry Pi" width="300" height="180" class="alignnone size-medium wp-image-3075" />Norch running on a Raspberry Pi</a></p>
<h2>Why using a search engine?</h2>
<p>It&#8217;s cheap and quick. I&#8217;m not a developer, and I&#8217;ll still be able to glue all these sources together. Search engines are often a good choice when you have multiple sources. IFTTT and Google SpreadSheet makes it even easier, normalising the input and delivering it as JSON.</p>
<h2>How far in the process have I come?</h2>
<p><a href="https://testlab3.files.wordpress.com/2014/11/15140752323_1f69685449_o.png"><img class="alignnone size-full wp-image-118" src="https://testlab3.files.wordpress.com/2014/11/15140752323_1f69685449_o.png" alt="Illustration: Setting up sources in IFTTT." width="660" height="469" /></a></p>
<p>So far, I&#8217;ve set up a lot of triggers/sources at IFTTT.com:</p>
<ul>
<li>Instagram: When posting or liking both photos and videos.</li>
<li>Flickr: When posting an image, creating a set or linking a photo.</li>
<li>Google Calendar: When adding something to one of my calendars.</li>
<li>Facebook: When i post a link, is tagged, post a status message.</li>
<li>Twitter: When I tweet, retweet, reply or if somebody mentions me.</li>
<li>Youtube: When I post or like a video.</li>
<li>GitHub: When I create an issue, gets assigned to an issue or any issues that I part take in is closed.</li>
<li>WordPress: When new posts or comments on posts.</li>
<li>Android location tracking: When I enter and exit certain areas.</li>
<li>Android phone log: Placed, received and missed calls.</li>
<li>Gmail: Starred emails.</li>
</ul>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-27-57.png"><img class="alignnone size-full wp-image-127" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-27-57.png" alt="Screen Shot 2014-11-24 at 13.27.57" width="660" height="572" /></a></p>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-31-46.png"><img class="alignnone size-full wp-image-128" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-31-46.png" alt="Screen Shot 2014-11-24 at 13.31.46" width="660" height="194" /></a></p>
<p>And gotten a good chunk of data. Indexing my SMS&#8217;es felt a bit creepy, so I stopped doing that. And storing email just sounded too excessive, but I think starred emails would suit the purpose of the project.</p>
<p>Those Google Drive documents are giving me JSON. Not JSON that I can feed directly Norch-indexer, it needs a little trimming.</p>
<h2>Issues discovered so far</h2>
<h3>Manual work</h3>
<p>This search solution needs a lot of manual setup. Every trigger needs to be set up manually. Everytime a new trigger is triggered, I get a new spreadsheet that needs a title row added. Or else, the JSON variables will look funny, since first row is used for variable names.</p>
<p>The spreadsheets only accepts 2000 rows. After that a new file is created. Either I need to delete content, rename the file or reconfigure some stuff.</p>
<h3>Level of maturity</h3>
<p><a href="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-41-34.png"><img class="alignnone size-full wp-image-129" src="https://testlab3.files.wordpress.com/2014/11/screen-shot-2014-11-24-at-13-41-34.png" alt="Screen Shot 2014-11-24 at 13.41.34" width="660" height="664" /></a></p>
<p>IFTTT is a really nice service, and they treat their users well. But, for now, it&#8217;s not something you can trust fully.</p>
<h3>Cleaning up duplicates and obsolete stuff</h3>
<p>I have no way of removing stuff from the index automatically at this point. If I delete something I&#8217;ve added/written/created, it will not be reflected in the index.</p>
<h3>Missing sources</h3>
<p>Books I buy, music I listen to, movies and TV-series I watch. Or Amazon, Spotify, Netflix and HBO. Apart from that, there are no Norwegian services available through IFTTT.</p>
<h3>History</h3>
<p>The crawling is triggered by my actions. That leaves me without history. So, i.e. new contacts on LinkedIn is meaningless when I don&#8217;t get to index the existing ones.</p>
<h2>Next steps</h2>
<h3>JSON clean-up</h3>
<p>I need to make a document processing step. <a href="https://github.com/fergiemcdowall/norch-document-processor">Norch-document-processor</a> would be nice if it had handled JSON in addition to HTML. <a href="https://github.com/fergiemcdowall/norch-document-processor/issues/6">Not yet, but maybe in the future</a>? Anyway, there&#8217;s just a small amount of JSON clean-up before I got my data in and index.</p>
<p>When this step is done, a first version can be demoed.</p>
<h3>UX and front-end code</h3>
<p>To show the full potential, I need some interaction design of the idea. For now they&#8217;re all in my head. And these sketches needs to be converted to HTML, CSS and Angular view.</p>
<h3>Embed codes</h3>
<p>Figure out how to embed Instagram, Flickr, Facebook and LinkedIn-posts, Google Maps, federated phonebook search etc.</p>
<h3>OAUTH configuration</h3>
<p>Set up <a href="https://github.com/ciaranj/node-oauth">OAUTH NPM package</a> to access non-public spreadsheets on Google Drive. Then I can add some of the less open information I have stored.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/11/26/idea-your-life-searchable-norch-node-search-ifttt-google-drive/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Crawl interfaces for Forage running inside your browser</title>
		<link>http://blog.comperiosearch.com/blog/2014/05/21/crawl-interfaces-for-forage-running-inside-your-browser/</link>
		<comments>http://blog.comperiosearch.com/blog/2014/05/21/crawl-interfaces-for-forage-running-inside-your-browser/#comments</comments>
		<pubDate>Wed, 21 May 2014 14:55:25 +0000</pubDate>
		<dc:creator><![CDATA[Espen Klem]]></dc:creator>
				<category><![CDATA[English]]></category>
		<category><![CDATA[User Experience]]></category>
		<category><![CDATA[crawl]]></category>
		<category><![CDATA[crawler]]></category>
		<category><![CDATA[forage]]></category>
		<category><![CDATA[Forage Fetch]]></category>
		<category><![CDATA[Forage Search Engine]]></category>
		<category><![CDATA[nodejs]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[user interface]]></category>
		<category><![CDATA[user interfaces]]></category>

		<guid isPermaLink="false">http://blog.comperiosearch.com/?p=2337</guid>
		<description><![CDATA[Got an idea a while back on how we could use the JavaScript/Nodejs Search Engine Forage so that the users would have their own search server inside the browser. The main takeaway from this would be that you don&#8217;t need to install anything to test the search engine. Since last time, I&#8217;ve made a quick [...]]]></description>
				<content:encoded><![CDATA[<p>Got an idea a while back on how we could use the JavaScript/Nodejs Search Engine <a href="https://github.com/fergiemcdowall/forage/">Forage</a> so that the users would <a href="http://blog.comperiosearch.com/blog/2014/04/29/idea-search-server-running-inside-your-browser/">have their own search server inside the browser</a>. The main takeaway from this would be that you don&#8217;t need to install anything to test the search engine. Since last time, I&#8217;ve made a quick logo for Forage, and drawn some more user interfaces. The mockups are mainly about crawl interfaces setting up the crawler, which in Forage terms is called Forage Fetch.</p>
<h2>Crawl interfaces, suggested</h2>
<p><a href="https://www.flickr.com/photos/eklem/14257669113/in/photostream/">Initial Crawl-window</a><br />
<a href="https://www.flickr.com/photos/eklem/14257669113/in/photostream/"><img class="alignnone" style="border: 1px solid black" src="https://farm3.staticflickr.com/2906/14257669113_822d5b524b.jpg" alt="javascript crawl interfaces" width="500" height="313" /></a></p>
<p>To crawl most pages elegantly and easily, you need five information elements:</p>
<ol>
<li>Somewhere to start. Which place do you want your crawler to start. You don&#8217;t have to specify the domain, we pick the domain name  from the page you&#8217;re visiting.</li>
<li>Which links to follow. This is not necessarily the pages you want to crawl. Typically these pages have lists of pages you want to crawl.</li>
<li>Which links not to follow. To not make the crawler go wild, you set some boundaries. Often a page has several URLs.</li>
<li>Which links to crawl. These are the actual pages you&#8217;re looking for.</li>
<li>Which links not to crawl.</li>
</ol>
<p>A simple illustration on the above rules. Forage Fetch doesn&#8217;t have all these features yet, but they&#8217;re <a href="https://github.com/fergiemcdowall/forage-fetch/issues/6">suggested as enhancements</a>.<br />
<a href="https://github.com/fergiemcdowall/forage-fetch/issues/6"><img class="alignnone" src="https://farm4.staticflickr.com/3731/12933582163_509b0e56ed.jpg" alt="" width="500" height="325" /></a></p>
<p><a href="https://www.flickr.com/photos/eklem/14257669223/in/set-72157643790505944">Selecting which rule type to add<br />
</a><a href="https://www.flickr.com/photos/eklem/14257669223/in/set-72157643790505944"><img class="alignnone" style="border: 1px solid black" src="https://farm3.staticflickr.com/2912/14257669223_bf7c7f179e.jpg" alt="javascript crawl interfaces" width="500" height="313" /></a></p>
<p>To ensure you&#8217;re adding valid rules, <a href="https://www.flickr.com/photos/eklem/14050882680/in/set-72157643790505944/">it&#8217;s a good ting to test first.<br />
</a><a href="https://www.flickr.com/photos/eklem/14050882680/in/set-72157643790505944/"><img class="alignnone" style="border: 1px solid black" src="https://farm6.staticflickr.com/5594/14050882680_4f6168e1c0.jpg" alt="javascript crawl interfaces" width="500" height="313" /></a></p>
<p><a href="https://www.flickr.com/photos/eklem/14050906527/in/set-72157643790505944/">Start URL added<br />
</a><a href="https://www.flickr.com/photos/eklem/14050906527/in/set-72157643790505944/"><img class="alignnone" style="border: 1px solid black" src="https://farm3.staticflickr.com/2897/14050906527_6b77d3977b.jpg" alt="javascript crawl interfaces" width="500" height="313" /></a></p>
<p><a href="https://www.flickr.com/photos/eklem/14235224692/in/set-72157643790505944/">The minimum amount of rules needed to start the crawler<br />
</a><a href="https://www.flickr.com/photos/eklem/14235224692/in/set-72157643790505944/"><img class="alignnone" style="border: 1px solid black" src="https://farm3.staticflickr.com/2898/14235224692_f79d48b310.jpg" alt="javascript crawl interfaces" width="500" height="313" /></a></p>
<p>Next tasks will be to make a clickable prototype in HTML/CSS and read up on HTML5 local storage/web storage.</p>
<p><strong>All comments on the idea are welcome! </strong>Here&#8217;s <a href="http://blog.comperiosearch.com/blog/tag/forage/">what we&#8217;ve blogged about Forage</a> so far.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.comperiosearch.com/blog/2014/05/21/crawl-interfaces-for-forage-running-inside-your-browser/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
