从ElasticSearch获取某个索引的所有_id的最快方法是什么?是否可以使用简单的查询?我的一个索引有大约20,000个文档。
答案 0 :(得分:56)
编辑:请阅读@Aleck Landgraf的答案
您只想要elasticsearch-internal _id
字段?或者是您文档中的id
字段?
对于前者,请尝试
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"stored_fields": []
}
'
请注意2017年更新:帖子最初包含"fields": []
但从那时起,名称已更改,stored_fields
是新值。
结果将仅包含文档的“元数据”
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 4,
"max_score" : 1.0,
"hits" : [ {
"_index" : "index",
"_type" : "type",
"_id" : "36",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "38",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "39",
"_score" : 1.0
}, {
"_index" : "index",
"_type" : "type",
"_id" : "34",
"_score" : 1.0
} ]
}
}
对于后者,如果要在文档中包含字段,只需将其添加到fields
数组
curl http://localhost:9200/index/type/_search?pretty=true -d '
{
"query" : {
"match_all" : {}
},
"fields": ["document_field_to_be_returned"]
}
'
答案 1 :(得分:39)
最好使用scroll and scan来获取结果列表,以便elasticsearch不必对结果进行排名和排序。
使用elasticsearch-dsl
python lib,可以通过以下方式完成:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
控制台日志:
GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...
注意:滚动从查询中提取批量结果,并将光标保持打开一段时间(1分钟,2分钟,您可以更新) ; 扫描禁用排序。 scan
辅助函数返回一个可以安全迭代的python生成器。
答案 2 :(得分:14)
另一个选择
curl 'http://localhost:9200/index/type/_search?pretty=true&fields='
将返回_index,_type,_id和_score。
答案 3 :(得分:13)
对于elasticsearch 5.x,您可以使用“_source”字段。
GET /_search
{
"_source": false,
"query" : {
"term" : { "user" : "kimchy" }
}
}
"fields"
已被弃用。
(错误:“不再支持字段[字段],如果字段未存储,请使用[stored_fields]检索存储的字段或_source过滤”)
答案 4 :(得分:3)
您也可以在python中执行此操作,它会为您提供正确的列表:
import elasticsearch
es = elasticsearch.Elasticsearch()
res = es.search(
index=your_index,
body={"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]})
ids = [d['_id'] for d in res['hits']['hits']]
答案 5 :(得分:2)
受到@ Aleck-Landgraf回答的启发,对我而言,它在标准的elasticsearch python API中直接使用scan函数起作用:
from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan
es = Elasticsearch()
for dobj in scan(es,
query={"query": {"match_all": {}}, "fields" : []},
index="your-index-name", doc_type="your-doc-type"):
print dobj["_id"],
答案 6 :(得分:1)
详细阐述@ Robert-Lujo和@ Aleck-Landgraf的2个答案(拥有权限的人可以很乐意将其发表评论): 如果您不想打印但是从返回的生成器中获取列表中的所有内容,请使用以下内容:
from elasticsearch import Elasticsearch,helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
a=helpers.scan(es,query={"query":{"match_all": {}}},scroll='1m',index=INDEX_NAME)#like others so far
IDs=[aa['_id'] for aa in a]
答案 7 :(得分:0)
对于Python用户:python elasticsearch client为滚动API提供了方便的抽象方法:
from elasticsearch import ElasticSearch, helpers
client = ElasticSearch()
query = {
"query": {
"match_all": {}
}
}
scan = helpers.scan(client, index=index, query=query, scroll='1m', size=100)
for doc in scan:
# do something
答案 8 :(得分:0)
我知道这篇文章有很多答案,但是我想结合几个答案来证明我发现最快的答案(无论如何在Python中)。我正在处理亿万个文档,而不是数千个。
helpers
类可与sliced scroll一起使用,从而允许执行多线程。就我而言,我还有一个高基数字段可提供(acquired_at
)。您会看到我将max_workers
设置为14,但是您可能希望根据您的计算机来更改此设置。
此外,我以压缩格式存储文档ID。如果您感到好奇,请you can check how many bytes your doc ids will be并估算最终的转储大小。
# note below I have es, index, and cluster_name variables already set
max_workers = 14
scroll_slice_ids = list(range(0,max_workers))
def get_doc_ids(scroll_slice_id):
count = 0
with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
for doc in scan:
count += 1
results_file.write((doc['_id'] + '\n'))
results_file.flush()
return count
if __name__ == '__main__':
print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
doc_counts = executor.map(get_doc_ids, scroll_slice_ids)
如果要跟踪文件中有多少个ID,可以使用unpigz -c /tmp/doc_ids_4.txt.gz | wc -l
。
答案 9 :(得分:0)
这正在工作!
def select_ids(self, **kwargs):
"""
:param kwargs:params from modules
:return: array of incidents
"""
index = kwargs.get('index')
if not index:
return None
# print("Params", kwargs)
query = self._build_query(**kwargs)
# print("Query", query)
# get results
results = self._db_client.search(body=query, index=index, stored_fields=[], filter_path="hits.hits._id")
print(results)
ids = [_['_id'] for _ in results['hits']['hits']]
return ids
答案 10 :(得分:-1)
Url -> http://localhost:9200/<index>/<type>/_query
http method -> GET
Query -> {"query": {"match_all": {}}, "size": 30000, "fields": ["_id"]}