Question

我正在尝试获取与查询匹配的文档中的令牌总数。我还没有定义任何自定义映射，而我想要获取令牌数的字段是＆＃39; string＆＃39;。

我尝试了以下查询，但是它提供了一个非常大的数字，大小为10 ^ 20，这不是我的数据集的正确答案。

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
   "query": {
      "match_all": {}
   },
   "aggs": {
      "tk_count": {
         "sum": {
            "script": "_index[\"body\"].sumttf()"
         }
      }
   },
   "size": 0
}

知道如何获得所有令牌的正确计数吗？（我不需要每个学期的计数，但总计数）。

Answer 1

好像你想在身体字段中检索总标记的cardinality。

在这种情况下，您可以像下面一样使用cardinality aggregation。

curl -XPOST 'localhost:9200/nodename/comment/_search?pretty' -d '
{
    "query": {
        "match_all": {}
    },
    "aggs": {
        "tk_count": {
            "cardinality" : {
                "field" : "body"
            }
        }
    },
    "size": 0
}

有关详细信息，请参阅this official document

Answer 2

这对我有用，是您需要的吗？

我的解决方案stores the token count on indexing using the token_count datatype.而不是在查询中获得令牌计数（使用tk_count聚合，如其他答案所示），这样我就可以获取“ name.stored_length” 值在查询结果中。

token_count是一个“多字段”，一次只能使用一个字段（即“名称”字段或“正文”字段）。我将示例稍作修改，以存储“ name.stored_length”

在我的示例中，请注意，它不会不计算令牌的基数（即不同的值），而是计算令牌总数； “ John John Doe”中有3个令牌； “ name.stored_length” === 3; （即使其计数的不同令牌只有2个）。请注意，我要求提供特定的"stored_fields" : ["name.stored_length"]

最后，您可能需要重新更新文档（即发送PUT），或使用任何技术为您为token_count使用的字段中的值重新编制索引（在这种情况下， PUT“ John John Doe”，即使它已经POST/PUT！）

PUT test_token_count
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "stored_length": { 
              "type":     "token_count",
              "analyzer": "standard",
     //------------------v
              "store": true
            }
          }
        }
      }
    }
  }
}

PUT test_token_count/_doc/1
{
    "name": "John John Doe" 
}

现在，我们可以查询或搜索结果，并将结果配置为包括name.stored_length字段（该字段既是多字段又是存储字段！）：

GET/POST test_token_count/_search
{
      //------------------v
    "stored_fields" : ["name.stored_length"]
}

搜索结果应包括令牌总数为named.stored_length ...

{
  ...
  "hits": {
     ...
    "hits": [
      {
        "_index": "test_token_count",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "fields": {
 //------------------v
          "name.stored_length": [
            3
          ]
        }
      }
    ]
  }
}

如何在elasticsearch中的文档中获取总令牌数

2 个答案: