By: André Lynum

André Lynum — Thu, 11 Jun 2015 10:02:39 +0000

Ayup, thanks letting us in on the name and mentioning the new sampler aggregation :) That looks quite interesting.

The precision/recall viewpoint is indeed illuminating. I tend to shy away from it since it feels tied to specific task or goal. I’m not really sure I would call bgC3 more precise than Microsoft in this case, although bgC3 probably points to a more limited set of documents. It’s all in the context I guess.

If you can structure your data in a manner that supports your task that is certainly a good idea. Chunking up words I feel is something that has to be done with more care. Pure collocational or n-gram approaches can end up emphasizing turns of phrase without interesting content, you might call them phrase stopwords. I’ld be more inclined to extract meaningful multiword units such as verb complexes or entities.

Doing sentence/paragraph docs incurs indexing costs and seem to me to be a poor mans estimate of the term-frequency. It’s certainly a practical solution but it would nice if there was an option for using the TF in the significant terms aggregation for foreground sets where it is feasible.

Anyhow, thanks for putting cool stuff like this into Elasticsearch and hope you’ll find ways to make significant terms into an even more precise and flexible tool than it is today.

Snakkes!

André

By: Mark

Mark — Thu, 11 Jun 2015 08:39:20 +0000

Nice write up. I plotted some of the other scoring heuristics here: https://twitter.com/elasticmark/status/513320986956292096

For me, the choice of scoring function is essentially a question of emphasis between precision and recall. If I analyse the results of a search for “Bill Gates” do I suggest “Microsoft” or “bgC3″ as the most significant terms? The rarer bgc3 is more closely allied (so high precision) but perhaps of less practical use due to poor recall. The inverse is true of Microsoft.

Putting aside the question of precision/recall emphasis, I found the following features do the most to improve significance suggestions on text:
1) Removal of duplicate/near duplicate text in the results being analysed
2) Discovery of phrases
3) Analysing top-matching samples not all content (see the new “sampler” agg)
4) “Chunking” documents e.g. into sentence-docs if the foreground sample of docs is low in number.
Additionally, text analysis in elasticsearch should ideally not be reliant on field data to avoid memory issues.
These approaches are a little detailed to get into here but are important parts of improving elasticsearch’s significance algos on free text.

The mystery of “JLH” is my daughter’s initials :)
Cheers,
Mark

Comments on: How Elasticsearch calculates significant terms

By: André Lynum

By: Mark