Elasticsearch:计算文档中的术语

时间:2018-12-01 14:21:14

标签: elasticsearch

我对elasticsearch并不陌生,请使用6.5版。我的数据库包含网站页面及其内容,例如:

Url      Content
abc.com  There is some content about cars here. Lots of cars!
def.com  This page is all about cars.
ghi.com  Here it tells us something about insurances.
jkl.com  Another page about cars and how to buy cars.

我已经能够执行一个简单的查询,该查询返回所有内容中包含“汽车”一词的文档(使用Python):

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"query": {"multi_match": {"query": "cars", "fields": ["*"]}}, 
    "from": 0, "size": 100})

结果看起来像这样:

{'took': 2521, 
'timed_out': False, 
'_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 
'hits': {'total': 29, 'max_score': 3.0240571, 'hits': [{'_index': 
'pages', '_type': 'pages', '_id': '17277', '_score': 3.0240571, 
'_source': {'content': '....'}}]}}

“ _ id”是指域,所以我基本上回来了:

  • abc.com
  • def.com
  • jkl.com

但是我现在想知道在每个文档中出现搜索词(“汽车”)的频率,例如:

  • abc.com:2
  • def.com:1
  • jkl.com:2

我找到了几种解决方案,该解决方案如何获取包含搜索词的文档数量,但没有一种解决方案可以告诉您如何获取文档中的 数量。尽管我非常确定official documentation中也找不到任何东西,但我可能只是没有意识到这是我的问题的解决方案。

更新

如@Curious_MInd所建议,我尝试了术语聚合:

current_app.elasticsearch.search(index=index, doc_type=index, 
    body={"aggs" : {"cars_count" : {"terms" : { "field" : "Content" 
}}}})

结果:

{'took': 729, 'timed_out': False, '_shards': {'total': 5, 'successful': 
5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 48, 'max_score': 1.0, 
'hits': [{'_index': 'pages', '_type': 'pages', '_id': '17252', 
'_score': 1.0, '_source': {'content': '...'}}]}, 'aggregations': 
{'skala_count': {'doc_count_error_upper_bound': 0, 
'sum_other_doc_count': 0, 'buckets': []}}}

我在这里看不到它将显示每个文档的计数,但是我假设这是因为“存储桶”为空?另一个要注意的是:术语聚合发现的结果明显比multi_match查询的结果差。有什么办法可以将它们结合起来?

2 个答案:

答案 0 :(得分:1)

我想您需要{strong>术语汇总,如下所示,See

GET /_search
{
    "aggs" : {
        "cars_count" : {
            "terms" : { "field" : "Content" }
        }
    }
}

答案 1 :(得分:1)

您要实现的目标无法在单个查询中完成。第一个查询将是过滤并获取需要对术语进行计数的文档ID。 假设您具有以下映射:

{
  "test": {
    "mappings": {
      "_doc": {
        "properties": {
          "details": {
            "type": "text",
            "store": true,
            "term_vector": "with_positions_offsets_payloads"
          },
          "name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

假设您的查询返回了以下两个文档:

{
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "details": "There is some content about cars here. Lots of cars!",
          "name": "n1"
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "details": "This page is all about cars",
          "name": "n2"
        }
      }
    ]
  }
}

从以上响应中,您可以获得与查询匹配的所有文档ID。对于以上内容,我们有:"_id": "1""_id": "2"

现在,我们使用_mtermvectors api来获取给定字段中每个术语的频率(计数):

test/_doc/_mtermvectors
{
  "docs": [
    {
      "_id": "1",
      "fields": [
        "details"
      ]
    },
    {
      "_id": "2",
      "fields": [
        "details"
      ]
    }
  ]
}

以上返回以下结果:

{
  "docs": [
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 8,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 2,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 28,
                  "end_offset": 32
                },
                {
                  "position": 9,
                  "start_offset": 47,
                  "end_offset": 51
                }
              ]
            },
            ....
          }
        }
      }
    },
    {
      "_index": "test",
      "_type": "_doc",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 2,
      "term_vectors": {
        "details": {
          "field_statistics": {
            "sum_doc_freq": 15,
            "doc_count": 2,
            "sum_ttf": 16
          },
          "terms": {
            ....
            ,
            "cars": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 5,
                  "start_offset": 23,
                  "end_offset": 27
                }
              ]
            },
            ....
        }
      }
    }
  ]
}

请注意,由于术语向量api返回所有术语的术语相关详细信息,因此我在字段中使用了....来表示其他术语数据。 您绝对可以从上述响应中提取有关所需字词的信息,此处显示的是cars,您感兴趣的字段是term_freq