Property Extraction in FS4SP

Trond Øivind Eriksen — Wed, 02 Mar 2011 07:40:04 +0000

Property extraction (previously called entity extraction in FAST ESP) is a process that extracts information from the visible textual content of an item and stores that information as additional crawled properties for the document.

In this blog post I will show how this can be automated in any given FAST Search for SharePoint installation. But first, just a short introduction to the extractors that we have out of the box:

Companies – extracts company names based on a generic dictionary.
Locations – extracts names of geographical locations based on a generic dictionary.
Person names – extracts names of persons based on a generic dictionary.

In most cases, you will have companies, locations and person names that are specific to your company or organization. However, you may want to modify the built-in property extractors by adding inclusion lists and exclusion lists to improve the quality of these extractors. Typically you can use customer lists from your CRM system, employee information from your ERP system, and product listings you might have.

In order to accomplish this in an easy manner, we need some PowerShell magic.

First, just a quick overview of the input paramenters to the script:

.\PropertyExtraction.ps1 -file [fileName] -type [companies|personnames|locations] -addto [include|exclude]

where the file parameter is the list of properties to extract, the type parameter specifies which property we’re dealing with and the addto parameter sets if the properties will be added to the include list or the exclude list.

Below is a snippet of where the fun takes place in code.

function ImportEntities()
{
    #Setting term entity dictionary to be "companies", "locations" or "personnames"
    $entityExtractorContext = New-Object -TypeName Microsoft.SharePoint.Search.Extended.Administration.EntityExtractorContext
    $entityExtractors = $entityExtractorContext.TermEntityExtractors
    foreach ($extractor in $entityExtractors)
    {
        if ($extractor.Name -eq $type)
        {
            $entityExtractor = $extractor
            log VERBOSE "Setting extractor to: $type"
        }
    }

	log VERBOSE "Reading file: $file"
	try{ $input = Get-Content $file }
	catch{ log ERROR "Failed to read file: $file"}

    #Iterating over properties dictionary and adding them if they haven't been added before
    $input = Get-Content $file
    foreach($entity in $input)
    {
        if ($addto -eq "include")
		{
			if ( $entityExtractor.Inclusions.Contains($entity) )
	        {
	            log WARNING "Entity already added ($type): $entity"
	            continue
	        }
	        else
	        {
	            try {
					$entityExtractor.Inclusions.Add($entity)
					log VERBOSE "Added entity ($type): $entity"
				} catch {
					log ERROR "Failed to add entity ($type): $entity"
				}
			}
		}
		elseif ($addto -eq "exclude")
		{
			if ( $entityExtractor.Exclusions.Contains($entity) )
	        {
	            log WARNING "Entity already excluded ($type): $entity"
	            continue
	        }
	        else
	        {
	            try {
					$entityExtractor.Exclusions.Add($entity)
					log VERBOSE "Excluded entity ($type): $entity"
				}catch {
					log ERROR "Failed to exclude entity ($type): $entity"
				}
	        }
		}
    }
    log VERBOSE "Finished loading $type properties."
}

An example company list file could look like:

Apple
Comperio
Microsoft

If you did a crawl before starting on your white and black lists, you can use any noise in your navigators as input to the black list, and also entities missing in the white lists. An easy way to get all navigator values is to search for “#”, which will do a blank search showing all data you have access to.

Now that you have added the new white and/or black lists, you should schedule a new crawl and the quality of the entity refiners should be improved. Voila!

Search Nuggets » property extraction

Property Extraction in FS4SP