elasticsearch上下文建议词

时间:2015-01-14 15:41:36

标签: elasticsearch autosuggest

有没有办法分析传递给上下文建议器的字段? 例如,如果我在我的映射中有这个:

mappings: {
    myitem: {
        title: {type: 'string'},
        content: {type: 'string'},
        user: {type: 'string', index: 'not_analyzed'},
        suggest_field: {
            type: 'completion',
            payloads: false,
            context: {
                user: {
                    type: 'category',
                    path: 'user'
                },
            }
        }
    }
}

我索引这个文档:

POST /myindex/myitem/1
{
    title: "The Post Title",
    content: ...,
    user: 123,
    suggest_field: {
        input: "The Post Title",
        context: {
            user: 123
        }
    }
}

我想首先分析输入,将其拆分为单独的单词,通过小写运行并停止单词过滤器,以便上下文建议器实际获取

    suggest_field: {
        input: ["post", "title"],
        context: {
            user: 123
        }
    }

我知道我可以将数组传递给suggest字段但是我希望在传递给ES之前避免缩小文本,拆分它,在我的应用程序中运行停用词过滤器。如果可能的话,我宁愿ES为我做这件事。我确实尝试将index_analyzer添加到字段映射中,但似乎没有实现任何目标。

或者,还有另一种方法可以获得单词的自动填充建议吗?

2 个答案:

答案 0 :(得分:0)

好的,所以这涉及很多,但我认为它或多或少会做你想要的。我不打算解释整个事情,因为这需要相当多的时间。但是,我会说我从this blog post开始并添加了stop token filter"title"字段包含使用不同分析器的子字段(以前称为multi_field),或者没有。该查询包含几个terms aggregations。另请注意,匹配查询会过滤聚合结果,只返回与文本查询相关的结果。

这是索引设置(花一些时间来查看这个;如果你有特定的问题,我会尝试回答它们,但我鼓励你先浏览一下博客文章):

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            },
            "stop_filter": {
               "type": "stop"
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter"
               ]
            },
            "stopword_only_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "asciifolding",
                  "stop_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "title": {
               "type": "string",
               "index_analyzer": "nGram_analyzer",
               "search_analyzer": "whitespace_analyzer",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "stopword_only": {
                     "type": "string",
                     "analyzer": "stopword_only_analyzer"
                  }
               }
            }
         }
      }
   }
}

然后我添加了一些文档:

PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}

现在我可以根据需要搜索带有字前缀的文件(或完整的单词,大写或不大写),并使用聚合返回匹配文档的完整标题,以及完整(非小写)单词,减去停用词:

POST /test_index/_search?search_type=count
{
    "query": {
      "match": {
         "title": {
            "query": "mer king",
            "operator": "or"
         }
      }
   }, 
    "aggs": {
        "word_tokens": {
            "terms": { "field": "title.stopword_only" }
        },
        "intact_titles": {
            "terms": { "field": "title.raw" }
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "intact_titles": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The Lion King",
               "doc_count": 1
            },
            {
               "key": "The Little Mermaid",
               "doc_count": 1
            }
         ]
      },
      "word_tokens": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The",
               "doc_count": 2
            },
            {
               "key": "King",
               "doc_count": 1
            },
            {
               "key": "Lion",
               "doc_count": 1
            },
            {
               "key": "Little",
               "doc_count": 1
            },
            {
               "key": "Mermaid",
               "doc_count": 1
            }
         ]
      }
   }
}

请注意,"The"会被返回。这似乎是因为默认的_english_停用词仅包含"the"。我没有立即找到解决方法。

以下是我使用的代码:

http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79

如果这有助于您解决问题,请告诉我。

答案 1 :(得分:0)

您可以设置一个分析器,为您完成此操作。

如果您按照名为you complete me的教程进行操作,则会有一个关于停用词的部分。

撰写本文后,elasticsearch的工作方式发生了变化。 standard分析器没有记录器可以删除停用词,因此您需要使用stop分析器。

映射

curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
  "mappings": {
    "hotel" : {
      "properties" : {
        "name" : { "type" : "string" },
        "city" : { "type" : "string" },
        "name_suggest" : {
          "type" :            "completion",
          "index_analyzer" :  "stop",//NOTE HERE THE DIFFERENCE 
          "search_analyzer" : "stop",//FROM THE ARTICELE!!
          "preserve_position_increments": false,
          "preserve_separators": false
        }
      } 
    }
  }
}'

获得建议

curl -X POST localhost:9200/hotels/_suggest -d '
{
  "hotels" : {
    "text" : "m",
    "completion" : {
      "field" : "name_suggest"
    }
  }
}'

希望这会有所帮助。我自己花了很长时间寻找这个答案。