我有这么大的数据集,我想要一个可以在图表中使用的样本。为此,我不需要所有数据,我需要每个第N项。
例如,如果我有4000个结果,而我只需要800个结果,我希望能够获得每5个结果。
所以有些喜欢:get,skip,skip,skip,skip,get,skip,skip,skip,..
我想知道在Elasticsearch中是否可以这样做?
答案 0 :(得分:4)
您最好使用脚本filter
。否则你不必要地使用分数。过滤器就像查询一样,但他们不会使用评分。
POST /test_index/_search
{
"query": {
"filtered": {
"filter": {
"script": {
"script": "doc['unique_counter'].value % n == 0",
"params" : {
"n" : 5
}
}
}
}
}
}
您最好不要在现实世界中使用动态脚本。
尽管如此,您可能希望查看用于绘制有关数据的分析信息的聚合,而不是采用任意样本。
答案 1 :(得分:2)
你能做到的一种方法是使用random scoring。根据严格的顺序,它不会准确地给你每一个项目,但如果你能放松这个要求,这个技巧应该做得很好。
为了测试它我设置了一个简单的索引(我将"doc_id"
映射到"_id"
只是为了使文档有一些内容,所以这部分不是必需的,如果&#39 ;不明显):
PUT /test_index
{
"mappings": {
"doc": {
"_id": {
"path": "doc_id"
}
}
}
}
然后我索引了十个简单的文件:
POST /test_index/doc/_bulk
{"index":{}}
{"doc_id":1}
{"index":{}}
{"doc_id":2}
{"index":{}}
{"doc_id":3}
{"index":{}}
{"doc_id":4}
{"index":{}}
{"doc_id":5}
{"index":{}}
{"doc_id":6}
{"index":{}}
{"doc_id":7}
{"index":{}}
{"doc_id":8}
{"index":{}}
{"doc_id":9}
{"index":{}}
{"doc_id":10}
现在我可以撤回三个随机文件:
POST /test_index/_search
{
"size": 3,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "some seed"
}
}
]
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0.93746644,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.93746644,
"_source": {
"doc_id": 1
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "10",
"_score": 0.926947,
"_source": {
"doc_id": 10
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "5",
"_score": 0.79400194,
"_source": {
"doc_id": 5
}
}
]
}
}
或者不同的随机三像这样:
POST /test_index/_search
{
"size": 3,
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": "some other seed"
}
}
]
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 0.817295,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "4",
"_score": 0.817295,
"_source": {
"doc_id": 4
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "8",
"_score": 0.469319,
"_source": {
"doc_id": 8
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 0.4374538,
"_source": {
"doc_id": 3
}
}
]
}
}
希望很清楚如何将此方法推广到您需要的方法。只需取出你想要的许多文件,无论多少块都能使它具有高效性。
以下是我用来测试的所有代码:
http://sense.qbox.io/gist/a02d4da458365915f5e9cf6ea80546d2dfabc75d
编辑:实际上,现在我考虑一下,如果你设置正确的话,你也可以使用scripted scoring精确地获取每个第n项。也许是这样的,
POST /test_index/_search
{
"size": 3,
"query": {
"function_score": {
"functions": [
{
"script_score": {
"script": "if(doc['doc_id'].value % 3 == 0){ return 1 }; return 0;"
}
}
]
}
}
}
...
{
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 10,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "3",
"_score": 1,
"_source": {
"doc_id": 3
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "6",
"_score": 1,
"_source": {
"doc_id": 6
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "9",
"_score": 1,
"_source": {
"doc_id": 9
}
}
]
}
}