我在ElasticSearch中有数百万条记录。今天,我意识到有一些记录重复。有没有办法删除这些重复的记录?
这是我的疑问。
{
"query": {
"filtered":{
"query" : {
"bool": {"must":[
{"match": { "sensorId": "14FA084408" }},
{"match": { "variableName": "FORWARD_FLOW" }}
]
}
},
"filter": {
"range": { "timestamp": { "gt" : "2015-07-04",
"lt" : "2015-07-06" }}
}
}
}
}
这就是我从中收到的。
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 21,
"max_score": 8.272615,
"hits": [
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxVcMpd7AZtvmZcK",
"_score": 8.272615,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxVnMpd7AZtvmZcL",
"_score": 8.272615,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxV6Mpd7AZtvmZcN",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxWOMpd7AZtvmZcP",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxW8Mpd7AZtvmZcT",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXFMpd7AZtvmZcU",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXbMpd7AZtvmZcW",
"_score": 8.0957,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxUtMpd7AZtvmZcG",
"_score": 8.077545,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxXPMpd7AZtvmZcV",
"_score": 8.077545,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
},
{
"_index": "iotsens-summarizedmeasures",
"_type": "summarizedmeasure",
"_id": "AU5isxUZMpd7AZtvmZcE",
"_score": 7.9553676,
"_source": {
"id": null,
"sensorId": "14FA084408",
"variableName": "FORWARD_FLOW",
"rawValue": "0.2",
"value": "0.2",
"timestamp": 1436047200000,
"summaryTimeUnit": "DAYS"
}
}
]
}
}
如您所见,我在同一天有21个重复记录。如何删除重复的记录,每天只保留一个?感谢。
答案 0 :(得分:2)
进行计数(对此使用Count API),然后使用query by query,查询大小比计数小1。 (使用delete by Query + From / Size API来获取此信息)
在这种情况下,您应该编写查询,使其只获得重复记录。
或只是查询id并在除一个之外的所有内容上调用批量删除。但是,我想你不能这样做,因为你没有Id。恕我直言,我没有看到任何其他聪明的方法来做到这一点。
答案 1 :(得分:0)
这是一个随意的想法,可能不完全符合您的需求。这仍然是我在第一次阅读你的问题时所感受到的。
我们如何使用任何elasticsearch客户端库重新索引整个数据。在这样做时,我们只计算每个对象(我的意思是文档)的哈希码,并将其设置为文档的id。任何具有完全相同字段的文档都会重新索引到相同的ID,因此重建索引完成后将删除重复。
答案 2 :(得分:0)
使用aggregate queries,您可以在ES索引中找到重复的字段:
e.g。在字段Uuid
中找到3个具有相同值的文档(并为每个Uuid
返回最多5个重复文档):
curl -XPOST http://localhost:9200/logstash-2017.03.17/_search -d '
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "Uuid",
"min_doc_count": 2,
"size": 3
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
"size": 5
}
}
}
}
}
}'
从输出中,您可以轻松过滤文档_id
并删除它们。使用jq
:
cat es_response.json | jq -r '.aggregations.duplicateCount.buckets[].duplicateDocuments.hits.hits[]._id'
然后天真的方法将使用DELETE
个请求:
curl -XDELETE http://localhost:9200/{index}/{document type}/{_id value}
然而,这将删除所有重复的文档,而不会在索引中留下单个唯一文档(通常,请参见下文)。此外,单独的DELETE
查询效率极低。
我写了一个es-deduplicator工具,为每组重复文档留下一个文档,并通过Bulk API删除其余部分。
这样可以在几分钟内删除数千个文档:
ES query took 0:01:44.922958, retrieved 10000 unique docs
Deleted 232539 duplicates, in total 1093490. Batch processed in 0:00:07.550461, running time 0:09:03.853110
ES query took 0:01:38.117346, retrieved 10000 unique docs
Deleted 219259 duplicates, in total 1312749. Batch processed in 0:00:07.351001, running time 0:10:50.322695
ES query took 0:01:40.111385, retrieved 10000 unique docs
注意:在循环中删除文档时,每次批量请求后refresh index都非常重要,否则下一个查询可能会返回已删除的文档。
根据设计汇总查询是近似的,很可能很少有文档会被省略(取决于你有多少个分片和节点)。使用多个节点(典型的群集设置),最好通过唯一字段再次查询(并删除额外的副本)。