Property Extraction in FS4SP

Property extraction (previously called entity extraction in FAST ESP) is a process that extracts information from the visible textual content of an item and stores that information as additional crawled properties for the document.

In this blog post I will show how this can be automated in any given FAST Search for SharePoint installation. But first, just a short introduction to the extractors that we have out of the box:

  • Companies – extracts company names based on a generic dictionary.
  • Locations – extracts names of geographical locations based on a generic dictionary.
  • Person names – extracts names of persons based on a generic dictionary.

In most cases, you will have companies, locations and person names that are specific to your company or organization. However, you may want to modify the built-in property extractors by adding inclusion lists and exclusion lists to improve the quality of these extractors. Typically you can use customer lists from your CRM system, employee information from your ERP system, and product listings you might have.

In order to accomplish this in an easy manner, we need some PowerShell magic.

First, just a quick overview of the input paramenters to the script:

where the file parameter is the list of properties to extract, the type parameter specifies which property we’re dealing with and the addto parameter sets if the properties will be added to the include list or the exclude list.

Below is a snippet of where the fun takes place in code.

An example company list file could look like:

If you did a crawl before starting on your white and black lists, you can use any noise in your navigators as input to the black list, and also entities missing in the white lists. An easy way to get all navigator values is to search for “#”, which will do a blank search showing all data you have access to.

Now that you have added the new white and/or black lists, you should schedule a new crawl and the quality of the entity refiners should be improved. Voila!

7 response to: «Property Extraction in FS4SP»

  1. March 2, 2011 at 22:49 | Permalink

    Great article – In addition to the out of the box extractors I have used the custom extractor Wholewords1 where you can build your own dictionary from scratch. Compared to the out of the box extractors I found a least one big disadvantage using a custom extractor. When you for example have two dictionary entries (key, value) Wine,Wine and Chardonnay,Chardonay and you use both keys in a document, you only get a refiner values for one of the values which in my opinion makes it useless for searching. I know that dictionary keys are case sensitive and ensured to have an exactly match in my document.

    I have tried to add some new keys to the Person names extractor and it works fine with multiple key values in a document.

    Do you have any experience with custom extractors and do you know how to get around the problem described above?

  2. March 3, 2011 at 09:21 | Permalink

    Steen, make sure the managed property where you map your wholeword crawled properties have MergeCrawledProperties = true. If not you will only get one value.

  3. March 3, 2011 at 09:44 | Permalink

    Thanks for the comment, Steen. Please let me know if Mikael’s approach doesn’t work. I might just have another trick up my sleeve.

  4. March 3, 2011 at 11:09 | Permalink

    Thanks Mikael, with MergeCrawledProperties = true it works fine. It is a little bit confusing that MergeCrawledProperties is called “Include values from all crawled properties mapped” in the GUI when you only want to map to one crawled property.

  5. March 3, 2011 at 12:33 | Permalink

    I agree that it’s not an easy find. But if you read the remarks on TechNet it says: This property must also be set to [true] to include all values from a multi-valued crawled property. If set to [false] it would only include the first value from a multi-valued crawled property.

    I learned it myself by looking at the config for location and people managed props, not by reading the docs ;)

  6. March 3, 2011 at 18:31 | Permalink

    It’s a little bit tricky to see whether a crawled property is multi-valued as the multi-valued field in the GUI is set to false even the property has a multi-valued value. Looking at the CrawledProperty Class, it does not contain a multi-valued property. Retrieving a crawled property through PowerShell you can see that it has an IsMultiValued property that is read only but it is still set to false even the content is multi-valued. The safe way to see if crawled property is multi-valued is to check the content and see if it contains a separator char i.e. a 0×2029

  7. March 14, 2011 at 02:28 | Permalink

    Hi, thanks for posting! I had been using dictionary extractors with custom people and locations lists, this is great.

    I did run across a strange behaviour, though, when I try to have a refiner for a property mapped to wholewordsx on the same refinement panel as one mapped to a pipeline extensiblity extracted property. It seems unbelievable but I reproduced it in two completely different environments. I can have my three refiners mapped to the wholewords dictionary extractors OR I can have my 6 refiners mapped to custom properties extracted with an extensibility component. The data is there in both cases, and it fails if I have just one of each kind.

    Just wondering if you’ve seen anything like that?

    Many thanks,

    John.



Leave a response





XHTML: These tags are allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">

301 Moved Permanently

Moved Permanently

The document has moved here.


OSLO

Comperio AS
Øvre Slottsgate 27
NO-0157 Oslo,
Norway
+47 22 33 71 00
View map

STOCKHOLM

Search Provider Sverige AB
Gamla Brogatan 34
SE-11 120 Stockholm
Sweden
+46 8-21 49 00
View map