我正在使用ElasticSearch来存储我从Twitter Streaming API收到的推文。在存储它们之前,我想在Tweet内容中应用英语词干分析器,为此,我尝试使用ElasticSearch分析器,但没有运气。
这是我正在使用的当前模板:
PUT _template/twitter
{
"template": "139*",
"settings" : {
"index":{
"analysis":{
"analyzer":{
"english":{
"type":"custom",
"tokenizer":"standard",
"filter":["lowercase", "en_stemmer", "stop_english", "asciifolding"]
}
},
"filter":{
"stop_english":{
"type":"stop",
"stopwords":["_english_"]
},
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
}
}
}
}
},
"mappings": {
"tweet": {
"_timestamp": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"_index": {
"enabled": true,
"store": true,
"index": "analyzed"
},
"properties": {
"geo": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"text": {
"type": "string",
"analyzer": "english"
}
}
}
}
}
当我启动Streaming并创建索引时,我所定义的所有映射似乎都正确应用,但文本存储为来自Twitter,完全是原始的。索引元数据显示:
"settings" : {
"index" : {
"uuid" : "xIOkEcoySAeZORr7pJeTNg",
"analysis" : {
"filter" : {
"en_stemmer" : {
"type" : "stemmer",
"name" : "english"
},
"stop_english" : {
"type" : "stop",
"stopwords" : [
"_english_"
]
}
},
"analyzer" : {
"english" : {
"type" : "custom",
"filter" : [
"lowercase",
"en_stemmer",
"stop_english",
"asciifolding"
],
"tokenizer" : "standard"
}
}
},
"number_of_replicas" : "1",
"number_of_shards" : "5",
"version" : {
"created" : "1010099"
}
}
},
"mappings" : {
"tweet" : {
[...]
"text" : {
"analyzer" : "english",
"type" : "string"
},
[...]
}
}
我做错了什么?分析仪似乎正确应用,但没有发生任何事情:/
谢谢!
PS:我用来实现分析器的搜索查询没有被应用:
curl -XGET 'http://localhost:9200/_all/_search?pretty' -d '{
"query": {
"filtered": {
"query": {
"bool": {
"should": [
{
"query_string": {
"query": "_index:1397574496990"
}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"match_all": {}
},
{
"exists": {
"field": "geo.coordinates"
}
}
]
}
}
}
},
"fields": [
"geo.coordinates",
"text"
],
"size": 50000
}'
这应该将词干文本作为一个字段返回,但响应是:
{
"took": 29,
"timed_out": false,
"_shards": {
"total": 47,
"successful": 47,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.97402453,
"hits": [
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086643423068161",
"_score": 0.97402453,
"fields": {
"geo.coordinates": [
-118.21122533,
33.79349318
],
"text": [
"Happy turtle Tuesday ! The week is slowly crawling to Wednesday good morning everyone ☀️#turtles… http://t.co/wAVmcxnf76"
]
}
},
{
"_index": "1397574496990",
"_type": "tweet",
"_id": "456086701451259904",
"_score": 0.97333175,
"fields": {
"geo.coordinates": [
-81.017636,
33.998741
],
"text": [
"Tuesday is Twins Day over here, apparently (it's a far too often occurrence) #tuesdaytwinsday… http://t.co/Umhtp6SoX6"
]
}
}
]
}
}
文本字段与来自Twitter的文本字段完全相同(我使用的是流媒体API)。我期望的是,随着分析仪的应用,文本字段被阻止了。
答案 0 :(得分:1)
分析仪不会影响数据的存储方式。因此,无论您使用哪种分析仪,您都将从源和存储的字段中获取相同的文本。搜索时应用分析器。因此,通过搜索text:twin
之类的内容并查找单词Twins
的记录,您将知道已应用词干分析器。