elasticsearch中的正则表达式?

时间:2015-02-13 11:59:53

标签: elasticsearch

弹性搜索中tokenizer的正则表达式模式应该分别匹配c#和c ++。现在我们有一个分析器用于此,但是当我们尝试搜索c#时,它也显示c ++作为匹配,反之亦然。

1 个答案:

答案 0 :(得分:1)

假设我正确理解你,你可以做的一件事就是设置一个只在空白处进行标记的分析器。默认standard analyzer会在符号和空格上进行标记,因此"c++""c#"都会转换为术语"c",因此两个文档都会匹配搜索一个或{其他

解决这个问题的一种方法(虽然它可能会引起其他麻烦),就是使用这样的分析器:

"whitespace_analyzer": {
   "type": "custom",
   "tokenizer": "whitespace",
   "filter": [
      "lowercase",
      "asciifolding"
   ]
}

或者,在完整的玩具示例中,我可以设置如下索引:

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

然后通过bulk api添加一些文档:

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}

现在搜索"C++"只能让我回复包含该术语的文档:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C++"
            }
         }
      ]
   }
}

同样与"C#"

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C#"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.70273256,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.70273256,
            "_source": {
               "text_field": "some text with C#"
            }
         }
      ]
   }
}

此解决方案可能会或可能不会最终为您提供您想要的内容,因为它也不会在标点符号上进行标记。

以下是我使用的代码:

http://sense.qbox.io/gist/92871671ea7313356cbbd1ea900c3d55944bd20b

编辑:这是一个稍微更高级的解决方案,可以帮助解决标点符号问题。我从this article得到了这个想法。基本思想是您可以将某些符号字符声明为字母数字字符。

所以我使用自定义token filter创建索引,然后添加相同的三个文档以及前一个解决方案无法正确处理的另一个文档:

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "symbol_filter": {
               "type": "word_delimiter",
               "type_table": [
                  "# => ALPHANUM",
                  "+ => ALPHANUM",
                  "@ => ALPHANUM"
               ]
            }
         },
         "analyzer": {
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "symbol_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "text_field": {
               "type": "string",
               "analyzer": "whitespace_analyzer"
            }
         }
      }
   }
}

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc", "_id":1}}
{"text_field": "some text with C++"}
{"index":{"_index":"test_index","_type":"doc", "_id":2}}
{"text_field": "some text with C#"}
{"index":{"_index":"test_index","_type":"doc", "_id":3}}
{"text_field": "some text with Objective-C"}
{"index":{"_index":"test_index","_type":"doc", "_id":4}}
{"text_field": "some text with Objective-C, C#, and C++."}

现在查询"C++"将返回包含该令牌的文档:

POST /test_index/_search
{
    "query": {
        "match": {
           "text_field": "C++"
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.643841,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.643841,
            "_source": {
               "text_field": "some text with C++"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "4",
            "_score": 0.40240064,
            "_source": {
               "text_field": "some text with Objective-C, C#, and C++."
            }
         }
      ]
   }
}

以下是此代码的代码:

http://sense.qbox.io/gist/5c583b4e99b8f3b088925ccdb894695aa0c257cb