使用Elasticsearch查找类似的文档

时间:2014-01-26 03:20:21

标签: lucene elasticsearch similarity tf-idf

我正在使用ElasticSearch开发服务,将上传的文件或网页存储为附件(文件是文档中的一个字段)。这部分工作正常,因为我可以使用like_text作为输入搜索这些文件。但是,此服务的第二部分应该将刚刚上传的文件与现有文件进行比较,以便查找重复文件或非常相似的文件,因此不建议用户使用相同的文件或相同的网页。问题是我无法获得相同文档的预期结果。相同文件之间的相似性各不相同,但绝不会超过0.4。更糟糕的是,有时候我会得到更好的分数,而不是两个完全相同的文件。无论输入如何,下面给出的java代码总是给出了相同顺序的文档集。从上传的文件中提取的like_text看起来总是一样的。

String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents- mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType)
            .source(jsonBuilder()
                .startObject()
                .field("file", txt)
                .field("title",title)
                    .field("visibility",visibilityType.name().toLowerCase())
                .field("ownerId",ownerId)
                .field("description",description)
                .field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
                .field("dateCreated",dateCreated)
                .field("url",link)
                .field("relatedToType",relatedToType)
                .field("relatedToId",relatedToId)
                .endObject()))
            .actionGet();
 client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,    ESIndexTypes.DOCUMENT, iResponse.getId());
  mltRequestBuilder.setField("file");
  SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet();
  SearchHits searchHits= response.getHits();
  System.out.println("getTotalHits:"+searchHits.getTotalHits());
  Iterator<SearchHit> hitsIter=searchHits.iterator();
  while(hitsIter.hasNext()){
    SearchHit searchHit=hitsIter.next();
    System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}

来自浏览器的查询如下:

http://localhost:9200/documents/document/m2HZM3hXS1KFHOwvGY1pVQ/_mlt?mlt_fields=file&min_doc_freq=1

给我结果:

   {"took":120,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
   "hits":{"total":4,
   "max_score":0.41059873,
    "hits":
      [{"_index":"documents","_type":"document",
      "_id":"gIe6NDEWRXWTMi4kMPRbiQ",
      "_score":0.41059873, 
      "_source" : 
             {"file":"PCFET0NUWVBFIGh..._skiping_the_file_content_here...",
              "title":"Univariate Analysis",
              "visibility":"public",
              "description":"Univariate Analysis Simple Tools for Description ",
              "contentType":"webpage",
              "dateCreated":"null",
              "url":"http://www.slideshare.net/christineshearer/univariate-analysis"}}

这是完全相同的网页,所以我期望得分为1.0而不是0.41,因为除了_id之外,两个文件之间没有差异。文件结果更糟糕。

我正在使用的映射是:

{
 "document":{
   "properties":{
      "title":{
            "type":"string",
            "store":true
        },
      "description":{
             "type":"string",
             "store":"yes"
       },
      "contentType":{
             "type":"string",
             "store":"yes"
       },
      "dateCreated":{
             "store":"yes",
             "type":"date"
       },
      "url":{
             "store":"yes",
            "type":"string"
      },
      "visibility": {
             "store":"yes",
            "type":"string"
      },
      "ownerId": {
            "type": "long",
            "store":"yes"
      },
      "relatedToType": {
            "type": "string",
            "store":"yes"
      },
      "relatedToId": {
            "type": "long",
            "store":"yes"
      },
      "file":{
            "path": "full",
            "type":"attachment",
            "fields":{
                "author": {
                    "type": "string"
                },
                "title": {
                    "store": true,
                    "type": "string"
                },
                "keywords": {
                    "type": "string"
                },
                "file": {
                    "store": true,
                    "term_vector": "with_positions_offsets",
                    "type": "string"
                },
                "name": {
                    "type": "string"
                },
                "content_length": {
                    "type": "integer"
                },
                "date": {
                    "format": "dateOptionalTime",
                    "type": "date"
                },
                "content_type": {
                    "type": "string"
                }
       } } } } }

有没有人知道这里可能出现什么问题?

由于

0 个答案:

没有答案