Search Nuggets » Mridu Agarwal

Experimenting with Open Source Web Crawlers

Mridu Agarwal — Fri, 29 Apr 2016 11:03:42 +0000

Whether you want to do market research or gather financial risk information or just get news about your favorite footballer from various news site, web scraping has many uses.

In my quest to learn know more about web crawling and scraping , I decided to test couple of Open Source Web Crawlers which were not only easily available but quite powerful as well. In this article I am mostly going to cover their basic features and how easy they are to start with.

If you are like one of those persons who likes to quickly get started while learning something, I would suggest that you try OpenWebSpider first.

It is a simple web browser based open source crawler and search engine which is simple to install and use and is very good for those who are trying to get acquainted to web crawling . It stores webpages in MySql or MongoDb. I used MySql for my testing purpose. You can follow the steps here to install it. It’s pretty simple and basic.

So, once you have installed everything , you just need to open a web-browser at http://127.0.0.1:9999/ and you are ready to crawl and search. Just check your database settings, type the Url of the site you want to crawl and within couple of minutes, you have all the data you need. You can even search it going to the search tab and typing in your query. Whoa! That was quick and compact and needless to say you don’t need any programming skills to crawl it.

If you are trying to create an off-line copy of your data or your very own mini Wikipedia, I think go for this as it’s the easiest way to do it.

Following are some screen shots:

You can also see the this Search engine demo here, before actually getting started.

Ok, after getting my hands on into web crawling, I was curious to do more sophisticated stuff like extracting topics from a web site where I do not have any RSS feed or API. Extracting this structured data could be quite important to many business scenarios where you are trying to follow competitor’s product news or gather data for business intelligence. I decided to use Scrapy for this experiment.

The good thing about Scrapy is that it is not only fast and simple, but very extensible as well. While installing it on my windows environment, I had few hiccups mainly because of the different compatible version of python but in the end, once you get it, it’s very simple(Isn’t that how you feel anyways , once things works ? Anyways, forget it! :D). Follow these links, if you are having trouble installing Scrapy like me:

https://github.com/scrapy/scrapy/wiki/How-to-Install-Scrapy-0.14-in-a-64-bit-Windows-7-Environment

http://doc.scrapy.org/en/latest/intro/install.html#intro-install

After installing, you need to create a Scrapy project. Since we are doing more customized stuff than just crawling the entire website, this requires more effort and knowledge of programming skills and sometime browser tools to understand the HTML DOM. You can follow this link to get started with you first Scrapy project .Once you have crawled the data that you need, it would be interesting to feed this data into a search engine. I have also been looking for open source web crawlers for Elastic Search and this looked like the perfect opportunity. Scrapy provides integration with Elastic Search out of the box , which is awesome. You just need to install the Elastic Search module for Scrapy(of course Elastic Search should be running somewhere) and configure the Item Pipeline for Scrapy. Follow this link for the step by step guide. Once done, you have the fully integrated crawler and search system!

I crawled http://primehealthchannel.com and created an index named “healthitems” in Scrapy.

To search the elastic search index, I am using Chrome extension Sense to send queries to Elastic Search, and this is how it looks

GET /scrapy/healthitems/_search

I hope you had fun reading this and now wants to try some of your own cool ideas . Do let us know how you used it and which crawler you like the most!

Content Enrichment Web Service SharePoint 2013 – Advantages and Challenges

Mridu Agarwal — Tue, 26 Apr 2016 11:23:22 +0000

If you have worked with search solutions before, you will know that very often there is a need to process data before it can be displayed in search results. This processing might be required to address some of(but not limited to) these common issues:

Missing metadata issues
Inconsistent metadata issues
Cleansing of content
Integration of semantic layers/Automatic tagging
Integration with 3rd party service
Merging data from other sources

Content Enrichment Web Service in SharePoint 2013 is a SOAP-based service within the content processing component that can be used to achieve this. The figure below shows a part of the process that takes place in the content processing component of SharePoint search.

Content Enrichment Web Service SharePoint 2013 combines the goodness of both FAST for SharePoint Search and SharePoint Search to offer a whole new set of possibilities and has its own challenges. To see an implementation example, check the MSDN link which pretty much sums up the basic steps. In this post we are going to look at some of the advantages and challenges of CEWS coming from a FAST 2010 background:

1. CEWS is a service and you DON’T have to deploy it in your SharePoint environment: Perhaps this is the biggest architectural change from the content processing perspective. What this means is that your code no longer runs in a sandbox environment within SharePoint Server. The webservice can be hosted anywhere outside your SharePoint server thus reducing deployment headaches and huge number of approvals required to deploy the executable files. I can see operations/infrastructure team/administrators smiling.

2.The web service processes and returns managed properties, not crawled properties: Managed properties correspond to what actually gets indexed and displayed in search results. So, this reduces some of the confusion as why I cant see the updated results( perhaps you had forgotten to map your crawled property to a managed property and wait you will have to index it AGAIN!). Nightmare!

3. You can define a trigger to limit the set of items that are processed by the web service: In FAST 2010, each item had to pass through the pipeline whether you wanted to process it or not. This check had to be done in the code. Trigger in 2013 will allow us to define this check outside the code so that only for selected content, web service is called. This will optimize the overall performance and improve crawling time, if you only want to process a subset of the content.

So far, so good! But.. there are certain challenges we need to look at and see how we can overcome it. In fact, this is the most important part when you are architecting your CEWS solution:

1. The content enrichment callout step can only be configured with a single web service endpoint : Now this sounds very limiting. I have multiple search applications and earlier I maintained the logic in different solutions. Do I need to combine them all into a single service? What about the maintenance and change request? Well there are several possible technologies one could consider to solve this but what I did in my project was to create a WCF routing service and let the routing service handle my multiple web services based on filters. You could also use it to implement load-balancing and fault tolerance. Here in the following example, I have two content sources “xmlfile” and “EpiFileShare”. I want to have two different services “xmlsvc” and “episvc” to process these different sources. This is how I will configure the end points in my WCF Routing Service: 2. Only one condition can be configured for Trigger. Different search application will require different triggers: Now, this can again be solved by using WCF routers and filters and configuring separate endpoints for separate triggers. Here I am using default managed property “ContentSource” as a trigger/filter to determine my service endpoint. To summarize, I have shown some of the advantages and challenges of the new CEWS architecture in SharePoint 2013 search and how you can overcome it. Hope that now you want to try this soon and share your experience with us.

Enhancing Web Analytics with Search Analytics

Mridu Agarwal — Wed, 20 May 2015 12:43:27 +0000

Web Analytics is the process of measuring and analyzing web data to assess and improve the effectiveness of a website.Tracking and improving search (search analytics) is an important part of web analytics which is often forgotten by many site owners. Website search analytics should not be underestimated as it can provide valuable insights into what users are looking for or what they are not able to find on the site. Recently, I read somewhere about an organization which increased their conversion rates by just increasing the size of their search box and working on the searches with zero results. Therefore, measuring and analyzing search could be a very important aspect in improving website’s effectiveness.

In this post, I will show you 5 quick steps to get started with Search Analytics:

1. Get Search Logs: There are various tools you can use to track and analyze your search. You many choose any one of them to get started depending on your business domain and organizational policies. I am using Google Analytics as a tool for this post simply because not only it is very powerful but it is very easy to start with. You can just create a google account and get started for “free”. No infrastructure setup required. Place the JavaScript code in your website pages and you are ready to start measuring.

Please note that there are many other tools in the market for more specific purposes and with higher complexities. It really depends on what your needs are. Here, we will continue with Google Analytics.

2. Understand your Site Search Usage: Many people underestimate the use of search on their website. So, the first step is to measure how many people are using search. Even if 5-10% of your visitors are using search, it is not a bad figure (depending on your business domain and site setup). This could very well be the most used navigation on your website.

3. Analyze your Search Terms: Get a list of searched terms to start with. Most often, you will see that there is a pattern. Few terms are searched more than the others. It is important to analyze the top 10 searched terms individually. You might want to group the other terms (long tail) and analyze them separately.

Try conducting searches for these searched term yourself and see if you are satisfied with the results. Did you get what you were expecting as the number one result? If not, you might want to make changes to your site to improve your search results.

Some additional things to ponder:

Looking at the searched terms do you see what you were expecting? Are their unknown terms?For example, if you have a product support site you might expect users to search more for some newly launched products.Do you see the product name or numbers in your searched term report? Are people looking for product comparison?

If you have some unexpected terms in your top 10 searched terms, then you might want to consider adding additional content related to those terms.

4. Evaluate User Experience: Are users happy or frustrated by the time they leave the site? Can they find what they are looking for or are they leaving immediately after performing a search? This is the toughest part because you are not sitting with the user and you can make only as much sense as the reports could tell. But the good news is that there are some metrics to watch out for which can provide valuable insights related to user experience. Couple of these metrics is shown in the picture below.

Result Pageviews/Search tells you about the number of results user viewed for the search term. If the number is too high you know that it is taking too long for users to find what they are looking for and they will not be happy about it.

% Search Exits is equivalent to bounce rate in web analytics. This number tells you about the percentage of people who left immediately after performing a search for that term without clicking on any of the search results. We would want this number to be as low as possible.

It is also important to evaluate search terms that produced 0 results. Users will not be happy to find zero results for their searches. There is no out of box metric in Google Analytics to track this but there are various ways to get around it using Events or Custom variables.

5. Improve Search Experience: What would you do if you found that there is a 35% Search Exit for your top keyword or 20 Result Pageviews/Search? In some cases this might be because the term is not spelled correctly or simply because user is using a term which is not an exact match with your content. For example, in an intranet environment of an organization, an employee is searching for “vacation list” but not getting any hits because you have “holiday list” in your content. Here, you might consider adding synonyms or best bets for these frequently search terms. Adding spelling correction like “Did you mean” or providing “related searches” for results could further help improve user experience and keep the visitors engaged in your website. If there are zero results for a search term you might want to consider adding additional content as well.

To conclude, I would say that there are lot of ways in which you can improve your site search analytics but the important part is to get started. It is not as tough as it may sound and it is worth the effort considering the amount of valuable information you get and the direct insight into user’s mind.