Question

我正在尝试使用path_hierarchy tokenizer索引路径，但它似乎只是标记化我提供的路径的一半。我试过不同的路径，结果似乎是一样的。

我的设置是 -

{
    "settings" : { 
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "analysis":{
            "analyzer":{
                "keylower":{
                    "type": "custom",
                    "tokenizer":"keyword",
                    "filter":"lowercase"
                },
                "path_analyzer": {
                    "type": "custom",
                    "tokenizer": "path_tokenizer",
                    "filter": [ "lowercase", "asciifolding", "path_ngrams" ]
                },
                "code_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                },
                "not_analyzed": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                }
            },
            "tokenizer": {
                "path_tokenizer": {
                  "type": "path_hierarchy"
                }
            },
            "filter": {
                "path_ngrams": {
                    "type": "edgeNGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "code_stemmer": {
                    "type": "stemmer",
                    "name": "minimal_english"
                }
            }
        }
    }
}

我的映射如下 -

{
  "dynamic": "strict",
  "properties": {
    "depot_path": {
      "type": "string",
      "analyzer": "path_analyzer"
    }
  },
  "_all": {
      "store": "yes",
      "analyzer": "english"
  }
}

我提供"//cm/mirror/v1.2/Kolkata/ixin-packages/builds/"作为depot_path进行分析我发现令牌形成如下 -

               "key": "//c",
               "key": "//cm",
               "key": "//cm/",
               "key": "//cm/m",
               "key": "//cm/mi",
               "key": "//cm/mir",
               "key": "//cm/mirr",
               "key": "//cm/mirro",
               "key": "//cm/mirror",
               "key": "//cm/mirror/",
               "key": "//cm/mirror/v",
               "key": "//cm/mirror/v1",
               "key": "//cm/mirror/v1.",

为什么整个路径没有被标记化？

我的预期输出是将令牌一直形成//cm/mirror/v1.2/Kolkata/ixin-packages/builds/

我尝试增加缓冲区大小但没有运气。有谁知道我做错了什么？

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html，

Answer 1

"max_gram": 15将令牌大小限制为15.如果增加"max_gram"，您会看到进一步的路径将被标记化。

以下是我的环境中的示例。

"max_gram" :15 
input path : /var/log/www/html/web/
path_analyser tokenized this path upto /var/log/www/ht i.e. 15 characters

 "max_gram" :100
    input path : /var/log/www/html/web/WANTED
    path_analyser tokenized this path upto /var/log/www/html/web/WANTED i.e. 28  characters <100

Answer 2

这是因为您已将"max_gram"的值设置为15。因此，您会注意到生成的最大标记（“// cm / mirror / v1。”）的长度为15。将其更改为非常大的数字，您将获得所需的令牌。

Elasticsearch path_hierarchy标记了路径的一半

2 个答案: