Question

测试数据：

curl -XPUT 'localhost:9200/customer/external/1?pretty' -d '{ "body": "this is a test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "and this is another test" }'
curl -XPUT 'localhost:9200/customer/external/2?pretty' -d '{ "body": "this thing is a test" }'

我的目标是获取文档中短语的频率。

我知道如何获取文档中术语的频率：

curl -g "http://localhost:9200/customer/external/1/_termvectors?pretty" -d'
{
        "fields": ["body"],
        "term_statistics" : true
}'

我知道如何计算包含给定短语的文档（使用match_phrase或span_near查询）：

curl -g "http://localhost:9200/customer/_count?pretty" -d'
{
  "query": {
    "match_phrase": {
      "body" : "this is"
      }
    }    
}'

如何访问短语的频率？

Answer 1

您可以使用termvectors。正如documentation

所述

返回值编辑

可以请求三种类型的值：术语信息，术语   统计和现场统计。默认情况下，所有学期信息和   为所有字段返回字段统计信息但不包含术语统计信息。   学期信息编辑
term frequency in the field (always returned)
term positions (positions : true)
start and end offsets (offsets : true)
term payloads (payloads : true), as base64 encoded bytes

你必须达到术语频率 - 在这个例子中，你可以看到doc中有john doe的频率。注意termvector复制应用它的字段的磁盘空间占用

Elasticsearch：获取给定文档中的短语频率

1 个答案: