使用Elasticsearch进行CamelCase搜索

时间:2017-01-18 11:06:22

标签: elasticsearch full-text-search analyzer camelcasing

我想配置Elasticsearch,以便搜索“JaFNam”将为“JavaFileName”创建一个好分数。

我尝试构建一个分析器,它将CamelCase pattern分析器与edge_ngram标记器结合在一起。我认为这会产生这样的术语:

J F N Ja Fi Na Jav Fil Nam Java File Name

但是令牌化器似乎没有任何效果:我不断得到这些术语:

Java File Name

正确的Elasticsearch配置是什么样的?

示例代码:

curl -XPUT    'http://127.0.0.1:9010/hello?pretty=1' -d'
{
  "settings":{
    "analysis":{
      "analyzer":{
        "camel":{
          "type":"pattern",
          "pattern":"([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])",
          "filters": ["edge_ngram"]
        }
      }
    }
  }
}
'
curl -XGET    'http://127.0.0.1:9010/hello/_analyze?pretty=1' -d'
{
  "analyzer":"camel",
  "text":"JavaFileName"
}'

结果:

{
  "tokens" : [ {
    "token" : "java",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "file",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "name",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "word",
    "position" : 2
  } ]
}

1 个答案:

答案 0 :(得分:2)

您的分析仪定义不正确。您需要tokenizerfilter数组,因为您的分析器无法正常工作。请尝试这样:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "tokenizer": "my_pattern",
          "filter": [
            "my_gram"
          ]
        }
      },
      "filter": {
        "my_gram": {
          "type": "edge_ngram",
          "max_gram": 10
        }
      },
      "tokenizer": {
        "my_pattern": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}