Elasticsearch不一致的方面计数

时间:2014-02-03 22:54:04

标签: elasticsearch

我似乎正在经历一些不一致的方面计数,我想知道为什么两者之间存在差异。我在下面运行了两个查询,您可以看到至少有一个术语的计数略有不同(参见术语21到底部)948 vs 1035是差异。最底层的第43项也有一个增量。

查询#1:

{'facets': {'primary_country_id': {'terms': {'field': 'primary_country_id', 'size': '20'}}}}

查询#1:

{'facets': {'primary_country_id': {'terms': {'field': 'primary_country_id', 'size': '30'}}}}

查询#1的结果:

{
  "primary_country_id": {
    "_type": "terms",
    "missing": 3475,
    "total": 312111,
    "other": 4460,
    "terms": [
      {
        "term": 41,
        "count": 187293
      },
      {
        "term": 9,
        "count": 24177
      },
      {
        "term": 50,
        "count": 17200
      },
      {
        "term": 15,
        "count": 13015
      },
      {
        "term": 30,
        "count": 10296
      },
      {
        "term": 32,
        "count": 8824
      },
      {
        "term": 6,
        "count": 7703
      },
      {
        "term": 23,
        "count": 7502
      },
      {
        "term": 2,
        "count": 5614
      },
      {
        "term": 33,
        "count": 5214
      },
      {
        "term": 16,
        "count": 4691
      },
      {
        "term": 24,
        "count": 3560
      },
      {
        "term": 31,
        "count": 3126
      },
      {
        "term": 7,
        "count": 2748
      },
      {
        "term": 12,
        "count": 1430
      },
      {
        "term": 19,
        "count": 1403
      },
      {
        "term": 8,
        "count": 1342
      },
      {
        "term": 46,
        "count": 1052
      },
      {
        "term": 21,
        "count": 948
      },
      {
        "term": 43,
        "count": 513
      }
    ]
  }
}

查询#2的结果:

{
  "primary_country_id": {
    "_type": "terms",
    "missing": 3475,
    "total": 312111,
    "other": 0,
    "terms": [
      {
        "term": 41,
        "count": 187293
      },
      {
        "term": 9,
        "count": 24177
      },
      {
        "term": 50,
        "count": 17200
      },
      {
        "term": 15,
        "count": 13015
      },
      {
        "term": 30,
        "count": 10296
      },
      {
        "term": 32,
        "count": 8824
      },
      {
        "term": 6,
        "count": 7703
      },
      {
        "term": 23,
        "count": 7502
      },
      {
        "term": 2,
        "count": 5614
      },
      {
        "term": 33,
        "count": 5214
      },
      {
        "term": 16,
        "count": 4691
      },
      {
        "term": 24,
        "count": 3560
      },
      {
        "term": 31,
        "count": 3126
      },
      {
        "term": 7,
        "count": 2748
      },
      {
        "term": 12,
        "count": 1430
      },
      {
        "term": 19,
        "count": 1403
      },
      {
        "term": 8,
        "count": 1342
      },
      {
        "term": 46,
        "count": 1052
      },
      {
        "term": 21,
        "count": 1035
      },
      {
        "term": 43,
        "count": 910
      },
      {
        "term": 22,
        "count": 906
      },
      {
        "term": 13,
        "count": 717
      },
      {
        "term": 28,
        "count": 690
      },
      {
        "term": 38,
        "count": 415
      },
      {
        "term": 26,
        "count": 352
      },
      {
        "term": 37,
        "count": 295
      },
      {
        "term": 25,
        "count": 208
      },
      {
        "term": 34,
        "count": 207
      },
      {
        "term": 4,
        "count": 94
      },
      {
        "term": 48,
        "count": 92
      }
    ]
  }
}

2 个答案:

答案 0 :(得分:1)

答案 1 :(得分:1)

这可以在任何分布式系统中发生,正如在另一个答案中提到的那样,它有github issue。唯一100%保证的解决方案是使用单个分片,但不会扩展。

问题表现在高基数字段,具有大量唯一字词的字段。您可以使用shard_size参数来控制每个分片请求的构面条目数,这可能与size(默认值10)不同,后者表示您返回的条目数。例如将size设置为10shard_size设置为100应该会让事情变得更好,但不能保证您完全准确地计算所有计数,它只是减少你看错计数的几率。你是否仍然得到错误的数量取决于你所面临的领域的基数。您可以想象,如果某个字段包含100个唯一字词,则设置为shard_size的{​​{1}}将保证始终具有完美的字数。