Question

我有这么大的数据集，我想要一个可以在图表中使用的样本。为此，我不需要所有数据，我需要每个第N项。

例如，如果我有4000个结果，而我只需要800个结果，我希望能够获得每5个结果。

所以有些喜欢：get，skip，skip，skip，skip，get，skip，skip，skip，..

我想知道在Elasticsearch中是否可以这样做？

Answer 1

您最好使用脚本filter。否则你不必要地使用分数。过滤器就像查询一样，但他们不会使用评分。

POST /test_index/_search
{
  "query": {
    "filtered": {
      "filter": {
        "script": {
          "script": "doc['unique_counter'].value % n == 0",
          "params" : {
            "n" : 5
          }
        }
      }
    }
  }
}

您最好不要在现实世界中使用动态脚本。

尽管如此，您可能希望查看用于绘制有关数据的分析信息的聚合，而不是采用任意样本。

Answer 2

你能做到的一种方法是使用random scoring。根据严格的顺序，它不会准确地给你每一个项目，但如果你能放松这个要求，这个技巧应该做得很好。

为了测试它我设置了一个简单的索引（我将"doc_id"映射到"_id"只是为了使文档有一些内容，所以这部分不是必需的，如果＆＃39 ;不明显）：

PUT /test_index
{
   "mappings": {
      "doc": {
         "_id": {
            "path": "doc_id"
         }
      }
   }
}

然后我索引了十个简单的文件：

POST /test_index/doc/_bulk
{"index":{}}
{"doc_id":1}
{"index":{}}
{"doc_id":2}
{"index":{}}
{"doc_id":3}
{"index":{}}
{"doc_id":4}
{"index":{}}
{"doc_id":5}
{"index":{}}
{"doc_id":6}
{"index":{}}
{"doc_id":7}
{"index":{}}
{"doc_id":8}
{"index":{}}
{"doc_id":9}
{"index":{}}
{"doc_id":10}

现在我可以撤回三个随机文件：

POST /test_index/_search
{
   "size": 3,
   "query": {
      "function_score": {
         "functions": [
            {
               "random_score": {
                  "seed": "some seed"
               }
            }
         ]
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0.93746644,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.93746644,
            "_source": {
               "doc_id": 1
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "10",
            "_score": 0.926947,
            "_source": {
               "doc_id": 10
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.79400194,
            "_source": {
               "doc_id": 5
            }
         }
      ]
   }
}

或者不同的随机三像这样：

POST /test_index/_search
{
   "size": 3,
   "query": {
      "function_score": {
         "functions": [
            {
               "random_score": {
                  "seed": "some other seed"
               }
            }
         ]
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 0.817295,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.817295,
            "_source": {
               "doc_id": 4
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "8",
            "_score": 0.469319,
            "_source": {
               "doc_id": 8
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 0.4374538,
            "_source": {
               "doc_id": 3
            }
         }
      ]
   }
}

希望很清楚如何将此方法推广到您需要的方法。只需取出你想要的许多文件，无论多少块都能使它具有高效性。

以下是我用来测试的所有代码：

http://sense.qbox.io/gist/a02d4da458365915f5e9cf6ea80546d2dfabc75d

编辑：实际上，现在我考虑一下，如果你设置正确的话，你也可以使用scripted scoring精确地获取每个第n项。也许是这样的，

POST /test_index/_search
{
   "size": 3,
   "query": {
      "function_score": {
         "functions": [
            {
               "script_score": {
                  "script": "if(doc['doc_id'].value % 3 == 0){ return 1 }; return 0;"
               }
            }
         ]
      }
   }
}
...
{
   "took": 13,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 10,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "3",
            "_score": 1,
            "_source": {
               "doc_id": 3
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "6",
            "_score": 1,
            "_source": {
               "doc_id": 6
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "9",
            "_score": 1,
            "_source": {
               "doc_id": 9
            }
         }
      ]
   }
}

获取Elasticsearch中的每个第N个结果

2 个答案: