基数聚合计算不同值的近似计数 . 但是,为什么即使对于存储在单个分片中的索引,它也显示不正确的值?
GET /jobs/_settings
{
"jobs": {
"settings": {
"index": {
"number_of_shards": "1",
...
position_id is long
GET /jobs/_search
{
"size": 0,
"aggs": {
"count_position_id": {
"value_count": {
"field": "position_id"
}
},
"unique_position_id": {
"cardinality": {
"field": "position_id",
"precision_threshold": 40000
}
}
}
}
{
"took": 44,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 52836,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_position_id": {
"value": 52930
},
"count_position_id": {
"value": 52836
}
}
}
1 回答
它更多地与用于计算基数的算法相比,而不是图片中的单个碎片 .
ES基数agg使用HLL(hyperloglog)工作,这是近似计数算法(它依赖于观察哈希的二进制表示来近似唯一值计数)
您可以通过增加precision_threshold来控制精度 . 因此,根据定义,这是“近似计数” - 并非真正不正确 .