在弹性搜索中查找重复文档

时间:2017-10-07 14:16:17

标签: elasticsearch

我正在寻找一种解决方案,以便在ElasticSearch中找到重复(确切)的文档。 我已经阅读了https://qbox.io/blog/minimizing-document-duplication-in-elasticsearch并尝试了它,但结果并不像我预期的那样,这是我的示例简单查询:

GET /last_month_ads/_search
{
  "size": 0,
  "fields": [
     "title"
  ], 
  "aggs": {
    "duplicateCount": {
      "terms": {
      "field": "title",
      "size" : 3        
      },
      "aggs": {
        "duplicateDocuments": {
          "top_hits": {}
        }
      }
    }
  }
}

,结果是

{
   "took": 981,
   "timed_out": false,
   "_shards": {
      "total": 2,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 482909,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "duplicateCount": {
         "doc_count_error_upper_bound": 11667,
         "sum_other_doc_count": 1958146,
         "buckets": [
            {
               "key": "CM",
               "doc_count": 46867,
               "duplicateDocuments": {
                  "hits": {
                     "total": 46867,
                     "max_score": 1,
                     "hits": [
                        {
                           "_index": "last_month_ads",
                           "_type": "ads",
                           "_id": "AV73EtoBQTqkjEa7YQG1",
                           "_score": 1,
                           "_source": {
                              "id": "20642316",
                              "cat_id": "43606",
                              "user_id": "1825875",
                              "title": "125 CM HOME",
                              "desc": "DESC"
                           }
                        },
                        {
                           "_index": "last_month_ads",
                           "_type": "ads",
                           "_id": "AV73EtpdQTqkjEa7YQHc",
                           "_score": 1,
                           "_source": {
                              "id": "20642379",
                              "cat_id": "43604",
                              "user_id": "4642299",
                              "title": "Home with Big CM",
                              "desc": "DESC"
                           }
                        },
                        {
                           "_index": "last_month_ads",
                           "_type": "ads",
                           "_id": "AV73Etp6QTqkjEa7YQHp",
                           "_score": 1,
                           "_source": {
                              "id": "20642409",
                              "cat_id": "43607",
                              "user_id": "4813303",
                              "title": "100 of live CM is here ",
                              "desc": "DESC"
                           }
                        }
                     ]
                  }
               }
            },

            }
         ]
      }
   }
}

我正在寻找标题中没有丰富词语的精确(或类似)标题,如何在弹性搜索中获得重复(类似)文档?

0 个答案:

没有答案