Question

我有大约15,000个已删除的网站，其正文存储在弹性搜索索引中。我需要在所有这些文本中使用前100个最常用的三字短语：

这样的事情：

Hello there sir: 203
Big bad pony: 92
First come first: 56
[...]

我是新手。我研究了术语向量，但它们似乎适用于单个文档。所以我觉得这将是术语向量和聚合与n-gram分类的组合。但我不知道如何实现这一点。任何指针都会有所帮助。

我当前的地图和设置：

{
  "mappings": {
    "items": {
      "properties": {
        "body": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

Answer 1

您正在寻找的内容称为带状疱疹。带状疱疹就像＆＃34;字n-gram＆＃34;：字符串中多个术语的连续组合。（例如＆＃34;我们都活着＃34;，＆＃34;所有人都住在＆＃34;，＆＃34;住在＆＃34;，＆＃34;黄色＆＃34;，＆＃34;黄色潜水艇＆＃34;）

看看这里：https://www.elastic.co/blog/searching-with-shingles

基本上，你需要一个带有木瓦分析仪的田地，只产生三项带状疱疹：

弹性博客帖子配置，但有：

"filter_shingle":{
   "type":"shingle",
   "max_shingle_size":3,
   "min_shingle_size":3,
   "output_unigrams":"false"
}

将chingle分析器应用到相关字段后（如在博客文章中）和重新索引数据，您应该能够发出一个返回简单{{3}的查询}，在body字段上查看最高的一百个单词短语。

{
  "size" : 0,
  "query" : {
    "match_all" : {}
  },
  "aggs" : {
    "three-word-phrases" : {
      "terms" : {
        "field" : "body",
        "size"  : 100  
      }
    }
  }
}

在所有文档中获取前100个最常用的三个单词短语

1 个答案: