字段中的点不会用于分析分析器

时间:2016-10-19 15:26:23

标签: elasticsearch

我有以下索引文档的映射(简化)

{
    "documents": {
        "mappings": {
            "document": {
                "properties": {
                    "filename": {
                        "type": "string",
                        "fields": {
                            "lower_case_sort": {
                                "type": "string",
                                "analyzer": "case_insensitive_sort"
                            },
                            "raw": {
                                "type": "string",
                                "index": "not_analyzed"
                            }
                        }
                    }
                }
            }
        }
    }
}

我把两个文件放到这个索引

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "_source": {
        "filename": "text.txt",
    }
}

...

{
    "_index": "documents",
    "_type": "document",
    "_id": "888",
    "_source": {
        "filename": "text 123.txt",
    }
}

针对" text"执行query_string或simple_query_string查询我原本希望得到两份文件。它们应该匹配,因为文件名是" text.txt"和"文本123.txt"。

http://localhost:9200/defiant/_search?q=text

但是,我只找到名为" test 123.txt" - " test.txt"只有在我搜索" test。*"或" test.txt"或"测试。???" - 我必须在文件名中添加点。

这是我对文档ID 777(text.txt)

的解释结果
curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'

- >

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "matched": false,
    "explanation": {
        "value": 0.0,
        "description": "Failure to meet condition(s) of required/prohibited clause(s)",
        "details": [{
            "value": 0.0,
            "description": "no match on required clause (_all:text)",
            "details": [{
                "value": 0.0,
                "description": "no matching term",
                "details": []
            }]
        }, {
            "value": 0.0,
            "description": "match on required clause, product of:",
            "details": [{
                "value": 0.0,
                "description": "# clause",
                "details": []
            }, {
                "value": 0.47650534,
                "description": "_type:document, product of:",
                "details": [{
                    "value": 1.0,
                    "description": "boost",
                    "details": []
                }, {
                    "value": 0.47650534,
                    "description": "queryNorm",
                    "details": []
                }]
            }]
        }]
    }
}

我搞砸了地图吗?我本以为'。'在索引文档时将其分析为术语分隔符...

已编辑:case_insensitive_sort的设置

{
    "documents": {
        "settings": {
            "index": {
                "creation_date": "1473169458336",
                "analysis": {
                    "analyzer": {
                        "case_insensitive_sort": {
                            "filter": [
                                "lowercase"
                            ],
                            "tokenizer": "keyword"
                        }
                    }
                }
            }
        }
    }
}

1 个答案:

答案 0 :(得分:1)

这是standard analyzer(默认分析器)的预期行为,因为它使用standard tokenizer并且根据其使用的algorithm不被视为作为分离的角色。

您可以借助analyze api

进行验证
curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : "test.txt"
}'

仅生成单个令牌

{
  "tokens": [
    {
      "token": "test.txt",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

您可以使用pattern replace char filter将空格替换为空格。

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "replace_dot"
          ]
        }
      },
      "char_filter": {
        "replace_dot": {
          "type": "pattern_replace",
          "pattern": "\\.",
          "replacement": " "
        }
      }
    }
  }
}

您必须重新索引您的文档,然后您才能获得所需的结果。 Analyze api 非常方便检查您的文档如何以倒排索引存储。

<强>更新

您必须指定要搜索的字段的名称。以下请求在_all field中查找 text ,默认情况下使用标准分析器。

http://localhost:9200/defiant/_search?q=text

我认为以下查询应该会给你想要的结果。

curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'