Question

我正在使用ElasticSearch 0.90.7，所以What exactly does the Standard tokenfilter do in Elasticsearch?的答案我不认为适用（但我所看到的是类似的）。

我构建以下内容：

curl -XDELETE "http://localhost:9200/testindex"
curl -XPOST "http://localhost:9200/testindex" -d'
{
  "mappings" : {
   "article" : {
     "properties" : {
       "text" : {
         "type" : "string"              
       }
     }
   }
 }
}'

我填写以下内容：

curl -XPUT "http://localhost:9200/testindex/article/1" -d'{
  "text": "file name. pdf"
}'

curl -XPUT "http://localhost:9200/testindex/article/2" -d'{
  "text": "file name.pdf"
}'

搜索返回以下内容：

curl -XPOST "http://localhost:9200/testindex/_search" -d '{
  "fields": [],
  "query": {
    "query_string": {
      "default_field": "text",
      "query": "\"file name\""
    }
  }
}'

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.30685282,
    "hits": [
      {
        "_index": "testindex",
        "_type": "article",
        "_id": "1",
        "_score": 0.30685282
      }
    ]
  }
}

...考虑到这一点，我猜测标准化标记器正在将文档＃2从文件名.pdf 更改为文件名称pdf

我的问题是：

我猜对了吗？
如果是的话：任何想法我可以使用什么标记器来处理这些情况？（或者我需要在提交之前在我的客户端处理文本吗？

Answer 1

您可以使用Analyze API检查自己。

这会产生file的代币name，pdf和"file name .pdf"，

以及file的代币name.pdf和"file name.pdf"。

StandardAnalyzer，或者更确切地说是StandardTokenizer，实现了Unicode文本分割算法中的Word Break规则，如Unicode Standard Annex #29中所述，其中包含：

不要在序列中断开，例如“3.2”

因此，"name.pdf"被StandardTokenizer视为完整的单词。

对于您的查询，SimpleAnalyzer会起作用。您可以使用Analyze API以及elasticsearch-inquisitor插件来测试可用的分析器。

Elasticsearch标准tokenizer不处理“a.b”条目？

1 个答案: