Question

我的目标是构建一个索引，对于每个文档，将通过单词ngrams（uni，bi和tri）将其分解，然后捕获所有这些单词ngrams上的术语向量分析。这可能与Elasticsearch有关吗？

例如，对于包含＆＃34;红色汽车驱动的文档字段。＆＃34;我将能够获得信息：

red - 1 instance
car - 1 instance
drives - 1 instance
red car - 1 instance
car drives - 1 instance
red car drives - 1 instance

提前致谢！

Answer 1

假设您已经知道Term Vectors api，您可以在索引时应用shingle token filter，以便在令牌流中将这些术语添加为彼此独立。

将min_shingle_size设置为1（而不是默认值2），将max_shingle_size设置为至少3（而不是默认值2）

基于你离开＆＃34;＆＃34;＆＃34;在应用带状符过滤器之前，您应该使用stop words filter之外的可能术语。

分析仪设置如下：

{
  "settings": {
    "analysis": {
      "analyzer": {
        "evolutionAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle"
          ]
        }
      },
      "filter": {
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_",
            "enable_position_increments":"false"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "1",
            "max_shingle_size": "3"
        }
      }
    }
  }
}

您可以使用_analyze api endpoint测试分析仪。

使用Word nGrams的多字词术语向量？

1 个答案: