我有以下索引文档的映射(简化)
{
"documents": {
"mappings": {
"document": {
"properties": {
"filename": {
"type": "string",
"fields": {
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
我把两个文件放到这个索引
{
"_index": "documents",
"_type": "document",
"_id": "777",
"_source": {
"filename": "text.txt",
}
}
...
{
"_index": "documents",
"_type": "document",
"_id": "888",
"_source": {
"filename": "text 123.txt",
}
}
针对" text"执行query_string或simple_query_string查询我原本希望得到两份文件。它们应该匹配,因为文件名是" text.txt"和"文本123.txt"。
http://localhost:9200/defiant/_search?q=text
但是,我只找到名为" test 123.txt" - " test.txt"只有在我搜索" test。*"或" test.txt"或"测试。???" - 我必须在文件名中添加点。
这是我对文档ID 777(text.txt)
的解释结果curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'
- >
{
"_index": "documents",
"_type": "document",
"_id": "777",
"matched": false,
"explanation": {
"value": 0.0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [{
"value": 0.0,
"description": "no match on required clause (_all:text)",
"details": [{
"value": 0.0,
"description": "no matching term",
"details": []
}]
}, {
"value": 0.0,
"description": "match on required clause, product of:",
"details": [{
"value": 0.0,
"description": "# clause",
"details": []
}, {
"value": 0.47650534,
"description": "_type:document, product of:",
"details": [{
"value": 1.0,
"description": "boost",
"details": []
}, {
"value": 0.47650534,
"description": "queryNorm",
"details": []
}]
}]
}]
}
}
我搞砸了地图吗?我本以为'。'在索引文档时将其分析为术语分隔符...
已编辑:case_insensitive_sort的设置
{
"documents": {
"settings": {
"index": {
"creation_date": "1473169458336",
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
}
}
答案 0 :(得分:1)
这是standard analyzer
(默认分析器)的预期行为,因为它使用standard tokenizer并且根据其使用的algorithm,点不被视为作为分离的角色。
您可以借助analyze api
进行验证curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "test.txt"
}'
仅生成单个令牌
{
"tokens": [
{
"token": "test.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
您可以使用pattern replace char filter将空格替换为空格。
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_dot"
]
}
},
"char_filter": {
"replace_dot": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
}
}
}
}
您必须重新索引您的文档,然后您才能获得所需的结果。 Analyze api 非常方便检查您的文档如何以倒排索引存储。
<强>更新强>
您必须指定要搜索的字段的名称。以下请求在_all field中查找 text ,默认情况下使用标准分析器。
http://localhost:9200/defiant/_search?q=text
我认为以下查询应该会给你想要的结果。
curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'