Question

我正在索引来自世界各地但主要是泰国的消息。索引邮件很可能包含英语或泰语。

有没有人知道设置ES索引的最佳方法，以便它能为泰语和英语搜索返回良好的搜索结果？

我尝试过使用此设置：

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type" : "cjk"
                }
            }
        }
    }
}

使用泰语搜索时，cjk分析仪的结果并不理想。我实际上不知道为什么会这样，但是非常感谢任何帮助！

Answer 1

您可以实现自定义泰国分析器，如下所述： http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#thai-analyzer

而且，为了使它更有用，还要添加一个新的过滤器，以便使用Apache Lucene中的org.apache.lucene.analysis.th.ThaiWordFilterFactory：

curl -X PUT http://localhost:9200/test -d '{
  "settings":{
    "analysis":{
      "analyzer":{
        "default":{
          "type":"custom",
          "tokenizer":"standard",
          "filters":[ "standard","thai","lowercase", "stop", "kstem" ]
        }
      }
    },
    "filter": {
      "thai": {
        "type": "org.apache.lucene.analysis.th.ThaiWordFilterFactory"
      }
    }
  }
}’

然后，你可以测试：

http://localhost:9200/test/_analyze?analyzer=thai&text=สวัสดี+hello

希望这会对你有所帮助。

Answer 2

cjk分析器用于生成中文，日语和韩语但不是泰语的双字母组合。由于泰语是非空间语言，因此该分析器不会对句子进行标记。推荐用于泰语的分析仪是thai分析仪。

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type" : "thai"
                }
            }
        }
    }
}

还可以使用提供icu_tokenizer的 ICU分析插件来分析泰国数据。此标记器支持泰语，老挝，中文，日语和韩语语言。您可以通过以下链接找到该插件：ICU Analysis Plugin

安装插件后，您可以这样使用tokenizer：

{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "type": "custom",
                    "tokenizer": "icu_tokenizer"
                }
            }
        }
    }
}

多语言ElasticSearch支持

2 个答案: