Question

在Elasticsearch中搜索iphone时，努力使iPhone匹配。

由于我有一些利害攸关的源代码，我当然需要CamelCase tokenizer，但它似乎将iPhone分成两个术语，因此无法找到iphone。

任何人都知道一种方法来添加异常以将camelCase单词分解为标记（camel + case）？

更新：为了说清楚，我希望NullPointerException被标记为[null，pointer，exception]，但我不希望iPhone成为[i，phone]。

还有其他解决方案吗？

更新2：@ ChintanShah的回答提示了一种不同的方法，它给了我们更多 - NullPointerException将被标记为[null，pointer，exception，nullpointer，pointerexception，nullpointerexception]，从这个角度来看，这肯定是更有用的。搜索的那个。索引也更快！支付的价格是指数大小，但它是一种优越的解决方案。

Answer 1

您可以使用word_delimiter token filter来达到您的要求。这是我的设置

{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "camel_filter",
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "camel_filter": {
          "type": "word_delimiter",
          "generate_number_parts": false,
          "stem_english_possessive": false,
          "split_on_numerics": false,
          "protected_words": [
            "iPhone",
            "WiFi"
          ]
        }
      }
    }
  },
  "mappings": {
  }
}

这会将大小写更改上的字词分开，以便将NullPointerException标记为 null ，指针和例外但 iPhone 和 WiFi 将保持原样受保护。 word_delimiter有很多灵活选择。你也可以 preserve_original 这对你有很大的帮助。

GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer

结果

{
   "tokens": [
      {
         "token": "iphone",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 1
      }
   ]
}

现在用

GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer

结果

{
   "tokens": [
      {
         "token": "null",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 1
      },
      {
         "token": "pointer",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 2
      },
      {
         "token": "exception",
         "start_offset": 11,
         "end_offset": 20,
         "type": "word",
         "position": 3
      }
   ]
}

另一种方法是用不同的分析器分析你的场两次，但我觉得word_delimiter可以解决这个问题。

这有帮助吗？

从Elasticsearch中的CamelCase tokenizer中排除

1 个答案: