Question

我有两个索引如下的文件文件1

{
  "_index": "custom-design",
  "_type": "cars",
  "_id": "porche129",
  "_score": 1.2413527,
  "_source": {
    "clientID": "ps1233443",
    "customisation": "yes",
    "userType": "heavy",
    "totalBilling": 3000
  }
}

}

文件2

{
  "_index": "custom-design",
  "_type": "cars",
  "_id": "porche232",
  "_score": 1.2413527,
  "_source": {
    "clientID": "ps1233443",
    "customisation": "yes",
    "userType": "heavy",
    "totalBilling": 3000
  }
}
}

正如您所看到的那样，这两个文档都已编入索引且具有不同的ID，但内容相同。是否可以在索引后检测并删除重复的文档？

Answer 1

理想情况下，您需要为每个文档创建哈希值。但是，由于现在不可能，我们可以使用脚本来做到这一点。

curl -XGET 'http://localhost:9200/Index/IndexType/_search?pretty=true' -d '{
  "size": 0,
  "aggs": {
    "duplicateCount": "terms": {
      "script": "doc['clientID'].value + doc['customisation'].value+doc['userType'].value+doc['totalBilling'].value",
      "min_doc_count": 2
    },      
    "aggs": {
      "duplicateDocuments": {
        "top_hits": {}
      }
    }
  }
}'

如果查看结果，可以在此处查看重复的文档。现在找到重复的ID并进行批量删除。

您可以在此处详细了解这些方法 - https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch

如何在elasticsearch中删除以前索引的文档？

1 个答案: