How FS4SP primary keys work
Just like in most systems that contains data, each indexed content item in FAST Search for SharePoint (FS4SP) is associated with a certain primary key. No surprises here – in order to update or remove an item from the index, the system must be able to uniquely identify it.
For the most part, Microsoft has done a great job integrating FAST into the SharePoint world, but there are certainly some areas where you notice there are actually two paradigms behind the scene: how things work in SharePoint, and how things (used to) work in FAST.
The primary key of the FS4SP index is one of those areas. In this post, I’ll try to make this a little more understandable.
First of all, the place where most people will notice that a primary key actually exists is in the crawl log on the FAST Content SSA.
Each indexed item is assigned an auto-incremented integer referred to as Item ID. All right, so let’s dig a little deeper.
If you’ve been reading up on FS4SP, you probably already know that there is an internal FAST-process called “qrserver”. It is responsible for receiving queries from the FAST Query SSA and internally forwarding the query to the actual index. You might also know that this process exposes a small web interface. For security reasons, it’s only available from within the server it runs on. More specifically: http://localhost:13280.
Searching for something here will return results in an internal XML format. The actual result items are listed under the <RESULTSET> tag a bit down in the XML. There are lots of things to talk about here, e.g. that the naming convention used internally in FAST is quite different to what is used when the results come back through the FAST Query SSA. A managed property for example, is referred to as a “field” internally. But let’s not fret over that now. Instead, let’s look at the first few properties of the first result:
After a few quick comparisons of the crawl log on the FAST Content SSA and the search results from the qrserver, it’s clear that the Item ID is stored in FS4SP’s index inside the property contentid. When comparing with the Item IDs listed on the FAST Content SSA, we also notice that SharePoint is prefixing the Item ID with “ssic://” when it’s stored in the index. In other words, the true primary key as used internally, is based on the pattern “ssic://[auto-incremented integer]”.
But as anyone who’s been using FAST pre-Microsoft can tell you, the contentid is actually not the primary key of the index. The real primary key is what’s stored inside the property internalid. The value of this property is the MD5 digest of the contentid, concatenated with the name of the content collection it is stored in. Let’s double-check, using our example. We had these two:
Calculating the MD5 digest of the contentid correctly yields the internalid (sans the collection suffix):
md5(“ssic://33”) == “8a832873c701c00135ce827d6c64c09c”
Since the internalid is suffixed with the name of the collection, we can actually put several items with the same contentid into the index. The requirement being that they’re stored in separate collections, so that the concatenated internalid value becomes unique. In FS4SP however, we often use only the default “sp” collection. Luckily, SharePoint makes sure to assign the Item IDs so that they’re unique across all collections, hence creating unique internalid:s even though the items are in the same collection.
However, there are ways in FS4SP to index data without going through the FAST Content SSA, i.e. you can index data without letting SharePoint know about it. This happens when you’re using any of the FAST Search specific connectors or the command-line tool docpush. These tools talk directly with the index, bypassing SharePoint completely. Thus, the content id item won’t be assigned an Item ID using the “ssic://” pattern.
So, what happens instead? Let’s try it out. Using the docpush tool, we can send an arbitrary web page into the index:
Using the qrserver web interface, we inspect what was indexed:
In this case, the contentid property is the URL of the web page we specified. This makes sense as the URL is unique for the whole web, and as such it is also a good candidate for being a primary key in the index. A URL is just a special case of a URI, which is what many of the FS4SP command-line tools use when referring to the primary key of the index. Examples being the docpush tool (when deleting a item from the index with the –d switch) and the waadmin tool (used for retrieving link cardinality data for an indexed item).
To sum up with some key points:
- The primary key of the index is stored in a property called contentid, though in the SharePoint GUIs it is referred to as an Item ID and look slightly different. They relate to one another as: [contentid] = ssic://[item id]
- Items that are indexed using the connectors of the FAST Content SSA are assigned a contentid on the form “ssic://”…
- Items that are indexed with the FAST Search specific connectors or the docpush tool do not follow the same pattern, but are typically a proper URL or a value from a database.
- If a command-line tool calls for a “URI” to an indexed item, use whatever is stored in the item’s contentid property.
Good explanation Marcus!
The good thing about using MD5′s is that the index can be independent of any crawler framework and still generate an internal ID to represent the item.
The bad part however is that, although minuscule, there is a chance of ID overlap, as it’s a check sum.
Storing an integer in the search index also take less space than md5 and will in most cases be more optimal. Time will tell if we can still use multiple crawler frameworks in the future, or if MS optimize it forcing everything via the SP crawler framework. Having one crawler framework makes maintenance a bit easier imo.
Timely tip indeed. I was trying to figure out why some items in my index have contentid that is not an integer. Microsoft has a PowerShell script (GetFiXML) that seems to require an integer contentid. So I guess I am out of luck when it comes to getting FiXML for content indexed by FAST Web Crawler.
Glad it was useful!
I guess you’re thinking of this tool http://gallery.technet.microsoft.com/scriptcenter/14105abb-29da-43fd-90f4-ac12f1a0233a ?
It asks for the internalid and the contentid, so in your case the contentid should be the full URL that was crawled, and the internalid is derived from the contentid as explained in the post above.
Just to elaborate on “Luckily, SharePoint makes sure to assign the Item IDs so that they’re unique across all collections, hence creating unique internalid:s even though the items are in the same collection.”
SharePoint will generate this unique ID with a counter. The counter is stored in the Content SSA and this is the reason why you cannot have more than one Content SSA, they would generate the same ID for different documents.
This is somewhat correct but also wrong.
If you add another Content SSA and point it towards a different collection than the first one, for example “sp2″ instead of “sp”, this will work just fine as the collection name is appended to the internal id in FS4SP. And you will not get a collision on ID’s in FS4SP. Yes, the same ID will appear in two Content SSA’s, but this works just fine.
Thanks for explian in the depth of FS4SP primaty key concept with respect of internalId and contentid.