Question

我正在通过ElasticSearch进行文本搜索，并且使用术语类型查询时出现问题。我在下面做的基本上是，

添加中文字符串（你好）的文件。
使用文本方法查询，然后返回文档。
使用term方法查询，不返回任何内容。

那么，它为什么会发生？以及如何解决它。

➜  curl -XPOST 'http://localhost:9200/test/test/' -d '{ "name" : "你好" }'

{
  "ok": true,
  "_index": "test",
  "_type": "test",
  "_id": "VdV8K26-QyiSCvDrUN00Nw",
  "_version": 1
}

➜  curl -XGET 'http://localhost:9200/test/test/_mapping?pretty=1'

{
  "test" : {
    "properties" : {
      "name" : {
        "type" : "string"
      }
    }
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1'

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 1.0,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "text": {
      "name": "你好"
    }
  }
}'

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.8838835,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "VdV8K26-QyiSCvDrUN00Nw",
        "_score": 0.8838835,
        "_source": {
          "name": "你好"
        }
      }
    ]
  }
}

➜  curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{
  "query": {
    "term": {
      "name": "你好"
    }
  }
}'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

Answer 1

来自ElasticSearch关于term query的文档：

匹配包含字词（未分析）的字段的文档。

默认情况下会分析name字段，因此术语查询无法找到它（仅查找未分析的字段）。您可以尝试使用不同的name（不是中文）来索引另一个文档，但查询一词也找不到它。如果您现在想知道为什么以下搜索查询会返回结果：

curl -XGET 'http://localhost:9200/test/test/_search?pretty=1' -d '{"query" : {"term" : { "name" : "好" }}}'

因为每个令牌都是一个未经分析的术语。如果你索引一个名为“你好吗”的文件，你也找不到包含“好吗”或“你好”的文件，但你可以找到包含“你”，“好”或“吗”的文件。一个术语查询。

对于中国人，您可能需要特别注意所使用的分析仪。对我来说，标准分析器似乎已经足够好了（按字符基础而不是空格标记中文短语）。

Answer 2

默认分析器不适合亚洲语言。尝试使用这样的分析器： https://github.com/elasticsearch/elasticsearch-analysis-smartcn

ElasticSearch是否支持Unicode /中文？

2 个答案: