Question

我正在使用此查询在字段中搜索短语的出现次数。

"query": {
    "match_phrase": {
       "content": "my test phrase"
  }
 }

我需要计算每个文档的每个短语发生了多少匹配（如果可能的话？）

我考虑过聚合器，但认为这些不符合要求，因为这些会给我整个索引的匹配数量而不是每个文档。

感谢。

Answer 1

这可以通过使用Script Fields / painless脚本来实现。

您可以计算每个字段的出现次数，并将其加到文档中。

示例：

## Here's my test index with some sample values

POST t1/doc/1  <-- this has one occurence
{
  "content" : "my test phrase"
}

POST t1/doc/2    <-- this document has 5 occurences
{
   "content": "my test phrase ",
   "content1" : "this is my test phrase 1",
   "content2" : "this is my test phrase 2",
   "content3" : "this is my test phrase 3",
   "content4" : "this is my test phrase 4"

}

POST t1/doc/3
{
  "content" : "my test new phrase"
}

现在，使用脚本，我可以计算每个字段的词组匹配。我每个字段都在统计一次，但是您可以将脚本修改为每个字段多个匹配。

很明显，这里的缺点是您需要在脚本中提及文档中的每个字段，除非有一种我不知道的遍历doc字段的方法。

POST t1/_search
{
  "script_fields": {
    "phrase_Count": {
      "script": {
        "lang": "painless",
        "source": """
                             int count = 0;

                            if(doc['content.keyword'].size() > 0 && doc['content.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content1.keyword'].size() > 0 && doc['content1.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content2.keyword'].size() > 0 && doc['content2.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content3.keyword'].size() > 0 && doc['content3.keyword'].value.indexOf(params.phrase)!=-1) count++;
                            if(doc['content4.keyword'].size() > 0 && doc['content4.keyword'].value.indexOf(params.phrase)!=-1) count++;

                            return count;
""",
        "params": {
          "phrase": "my test phrase"
        }
      }
    }
  }
}

这将使我将每个文档的短语计数作为脚本字段

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "2",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            5                 <--- count of occurrences of the phrase in the document
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            1
          ]
        }
      },
      {
        "_index" : "t1",
        "_type" : "doc",
        "_id" : "3",
        "_score" : 1.0,
        "fields" : {
          "phrase_Count" : [
            0
          ]
        }
      }
    ]
  }
}

Answer 2

您可以使用术语向量来实现此功能。请看一看 Term Vectors

Elasticsearch - 每个文档的匹配数

2 个答案: