有没有办法分析传递给上下文建议器的字段? 例如,如果我在我的映射中有这个:
mappings: {
myitem: {
title: {type: 'string'},
content: {type: 'string'},
user: {type: 'string', index: 'not_analyzed'},
suggest_field: {
type: 'completion',
payloads: false,
context: {
user: {
type: 'category',
path: 'user'
},
}
}
}
}
我索引这个文档:
POST /myindex/myitem/1
{
title: "The Post Title",
content: ...,
user: 123,
suggest_field: {
input: "The Post Title",
context: {
user: 123
}
}
}
我想首先分析输入,将其拆分为单独的单词,通过小写运行并停止单词过滤器,以便上下文建议器实际获取
suggest_field: {
input: ["post", "title"],
context: {
user: 123
}
}
我知道我可以将数组传递给suggest字段但是我希望在传递给ES之前避免缩小文本,拆分它,在我的应用程序中运行停用词过滤器。如果可能的话,我宁愿ES为我做这件事。我确实尝试将index_analyzer添加到字段映射中,但似乎没有实现任何目标。
或者,还有另一种方法可以获得单词的自动填充建议吗?
答案 0 :(得分:0)
好的,所以这涉及很多,但我认为它或多或少会做你想要的。我不打算解释整个事情,因为这需要相当多的时间。但是,我会说我从this blog post开始并添加了stop token filter。 "title"
字段包含使用不同分析器的子字段(以前称为multi_field),或者没有。该查询包含几个terms aggregations。另请注意,匹配查询会过滤聚合结果,只返回与文本查询相关的结果。
这是索引设置(花一些时间来查看这个;如果你有特定的问题,我会尝试回答它们,但我鼓励你先浏览一下博客文章):
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 2,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
},
"stop_filter": {
"type": "stop"
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"stop_filter"
]
},
"stopword_only_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"asciifolding",
"stop_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"title": {
"type": "string",
"index_analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"stopword_only": {
"type": "string",
"analyzer": "stopword_only_analyzer"
}
}
}
}
}
}
}
然后我添加了一些文档:
PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}
现在我可以根据需要搜索带有字前缀的文件(或完整的单词,大写或不大写),并使用聚合返回匹配文档的完整标题,以及完整(非小写)单词,减去停用词:
POST /test_index/_search?search_type=count
{
"query": {
"match": {
"title": {
"query": "mer king",
"operator": "or"
}
}
},
"aggs": {
"word_tokens": {
"terms": { "field": "title.stopword_only" }
},
"intact_titles": {
"terms": { "field": "title.raw" }
}
}
}
...
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"intact_titles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The Lion King",
"doc_count": 1
},
{
"key": "The Little Mermaid",
"doc_count": 1
}
]
},
"word_tokens": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "The",
"doc_count": 2
},
{
"key": "King",
"doc_count": 1
},
{
"key": "Lion",
"doc_count": 1
},
{
"key": "Little",
"doc_count": 1
},
{
"key": "Mermaid",
"doc_count": 1
}
]
}
}
}
请注意,"The"
会被返回。这似乎是因为默认的_english_
停用词仅包含"the"
。我没有立即找到解决方法。
以下是我使用的代码:
http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79
如果这有助于您解决问题,请告诉我。
答案 1 :(得分:0)
您可以设置一个分析器,为您完成此操作。
如果您按照名为you complete me的教程进行操作,则会有一个关于停用词的部分。
撰写本文后,elasticsearch的工作方式发生了变化。 standard
分析器没有记录器可以删除停用词,因此您需要使用stop
分析器。
curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
"mappings": {
"hotel" : {
"properties" : {
"name" : { "type" : "string" },
"city" : { "type" : "string" },
"name_suggest" : {
"type" : "completion",
"index_analyzer" : "stop",//NOTE HERE THE DIFFERENCE
"search_analyzer" : "stop",//FROM THE ARTICELE!!
"preserve_position_increments": false,
"preserve_separators": false
}
}
}
}
}'
curl -X POST localhost:9200/hotels/_suggest -d '
{
"hotels" : {
"text" : "m",
"completion" : {
"field" : "name_suggest"
}
}
}'
希望这会有所帮助。我自己花了很长时间寻找这个答案。