Question

我正在使用ElasticSearch来存储我从Twitter Streaming API收到的推文。在存储它们之前，我想在Tweet内容中应用英语词干分析器，为此，我尝试使用ElasticSearch分析器，但没有运气。

这是我正在使用的当前模板：

PUT _template/twitter
{
  "template": "139*",
  "settings" : {
    "index":{
      "analysis":{
        "analyzer":{
          "english":{
            "type":"custom",
            "tokenizer":"standard",
            "filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
          }
        },
        "filter":{
          "stop_english":{
            "type":"stop",
            "stopwords":["_english_"]
          },
          "en_stemmer" : {
            "type" : "stemmer",
            "name" : "english"
          }
        }
      }
    }
  },
  "mappings": {
    "tweet": {
      "_timestamp": {
        "enabled": true,
        "store": true,
        "index": "analyzed"
      },
      "_index": {
        "enabled": true,
        "store": true,
        "index": "analyzed"
      },
      "properties": {
        "geo": {
          "properties": {
            "coordinates": {
              "type": "geo_point"
            }
          }
        },
        "text": {
          "type": "string",
          "analyzer": "english"
        }
      }
    }
  }
}

当我启动Streaming并创建索引时，我所定义的所有映射似乎都正确应用，但文本存储为来自Twitter，完全是原始的。索引元数据显示：

"settings" : {
    "index" : {
        "uuid" : "xIOkEcoySAeZORr7pJeTNg",
        "analysis" : {
            "filter" : {
                "en_stemmer" : {
                    "type" : "stemmer",
                    "name" : "english"
                 },
                 "stop_english" : {
                     "type" : "stop",
                     "stopwords" : [
                         "_english_"
                     ]
                 }
             },
             "analyzer" : {
                 "english" : {
                     "type" : "custom",
                     "filter" : [
                         "lowercase",
                         "en_stemmer",
                         "stop_english",
                         "asciifolding"
                     ],
                     "tokenizer" : "standard"
                 }
             }
         },
        "number_of_replicas" : "1",
        "number_of_shards" : "5",
        "version" : {
            "created" : "1010099"
        }
    }
},
"mappings" : {
    "tweet" : {
        [...]
        "text" : {
            "analyzer" : "english",
            "type" : "string"
        },
        [...]
    }
}

我做错了什么？分析仪似乎正确应用，但没有发生任何事情：/

谢谢！

PS：我用来实现分析器的搜索查询没有被应用：

curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
  "query": {
    "filtered": {
      "query": {
        "bool": {
          "should": [
            {
              "query_string": {
                "query": "_index:1397574496990"
              }
            }
          ]
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "match_all": {}
            },
            {
              "exists": {
                "field": "geo.coordinates"
              }
            }
          ]
        }
      }
    }
  },
  "fields": [
    "geo.coordinates",
    "text"
  ],
  "size": 50000
}'

这应该将词干文本作为一个字段返回，但响应是：

{
   "took": 29,
   "timed_out": false,
   "_shards": {
      "total": 47,
      "successful": 47,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.97402453,
      "hits": [
         {
            "_index": "1397574496990",
            "_type": "tweet",
            "_id": "456086643423068161",
            "_score": 0.97402453,
            "fields": {
               "geo.coordinates": [
                  -118.21122533,
                  33.79349318
               ],
               "text": [
                  "Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone ☀️#turtles… http://t.co/wAVmcxnf76"
               ]
            }
         },
         {
            "_index": "1397574496990",
            "_type": "tweet",
            "_id": "456086701451259904",
            "_score": 0.97333175,
            "fields": {
               "geo.coordinates": [
                  -81.017636,
                  33.998741
               ],
               "text": [
                  "Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
               ]
            }
         }
      ]
   }
}

文本字段与来自Twitter的文本字段完全相同（我使用的是流媒体API）。我期望的是，随着分析仪的应用，文本字段被阻止了。

Answer 1

分析仪不会影响数据的存储方式。因此，无论您使用哪种分析仪，您都将从源和存储的字段中获取相同的文本。搜索时应用分析器。因此，通过搜索text:twin之类的内容并查找单词Twins的记录，您将知道已应用词干分析器。

ElasticSearch中的分析器无法正常工作

1 个答案: