A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
Example use cases:
significant_terms
Example:
A query on StackOverflow data for the popular term javascript OR the rarer term
kibana will match many documents - most of them missing the word Kibana. To focus
the significant_terms aggregation on top-scoring documents that are more likely to match
the most interesting parts of our query we use a sample.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "tags",
"exclude": ["kibana", "javascript"]
}
}
}
}
}
}Response:
{
...
"aggregations": {
"sample": {
"doc_count": 200,
"keywords": {
"doc_count": 200,
"bg_count": 650,
"buckets": [
{
"key": "elasticsearch",
"doc_count": 150,
"score": 1.078125,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.5625,
"bg_count": 50
}
]
}
}
}
}200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded. |
Without the sampler aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as jquery and angular rather than focusing on the more insightful Kibana-related terms.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"low_quality_keywords": {
"significant_terms": {
"field": "tags",
"size": 3,
"exclude":["kibana", "javascript"]
}
}
}
}Response:
{
...
"aggregations": {
"low_quality_keywords": {
"doc_count": 600,
"bg_count": 650,
"buckets": [
{
"key": "angular",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "jquery",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.0069,
"bg_count": 50
}
]
}
}
}The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores.
In this situation an error will be thrown.