ElasticSearch Cardinality问题-Java 学习之路

基数聚合计算不同值的近似计数 . 但是，为什么即使对于存储在单个分片中的索引，它也显示不正确的值？

GET /jobs/_settings

    {
      "jobs": {
        "settings": {
          "index": {
            "number_of_shards": "1",
    ...


    position_id is long

    GET /jobs/_search
    {
      "size": 0,
      "aggs": {
        "count_position_id": {
          "value_count": {
            "field": "position_id"
          }
        },
        "unique_position_id": {
          "cardinality": {
            "field": "position_id",
            "precision_threshold": 40000
          }
        }
      }
    }

    {
      "took": 44,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "failed": 0
      },
      "hits": {
        "total": 52836,
        "max_score": 0,
        "hits": []
      },
      "aggregations": {
        "unique_position_id": {
          "value": 52930
        },
        "count_position_id": {
          "value": 52836
        }
      }
    }

1 回答

1

它更多地与用于计算基数的算法相比，而不是图片中的单个碎片 .

ES基数agg使用HLL（hyperloglog）工作，这是近似计数算法（它依赖于观察哈希的二进制表示来近似唯一值计数）

您可以通过增加precision_threshold来控制精度 . 因此，根据定义，这是“近似计数” - 并非真正不正确 .

回复于 2024-04-26T06:18:24+08:00

ElasticSearch Cardinality问题

1 回答

相关问题