Question

有没有办法分析传递给上下文建议器的字段？例如，如果我在我的映射中有这个：

mappings: {
    myitem: {
        title: {type: 'string'},
        content: {type: 'string'},
        user: {type: 'string', index: 'not_analyzed'},
        suggest_field: {
            type: 'completion',
            payloads: false,
            context: {
                user: {
                    type: 'category',
                    path: 'user'
                },
            }
        }
    }
}

我索引这个文档：

POST /myindex/myitem/1
{
    title: "The Post Title",
    content: ...,
    user: 123,
    suggest_field: {
        input: "The Post Title",
        context: {
            user: 123
        }
    }
}

我想首先分析输入，将其拆分为单独的单词，通过小写运行并停止单词过滤器，以便上下文建议器实际获取

    suggest_field: {
        input: ["post", "title"],
        context: {
            user: 123
        }
    }

我知道我可以将数组传递给suggest字段但是我希望在传递给ES之前避免缩小文本，拆分它，在我的应用程序中运行停用词过滤器。如果可能的话，我宁愿ES为我做这件事。我确实尝试将index_analyzer添加到字段映射中，但似乎没有实现任何目标。

或者，还有另一种方法可以获得单词的自动填充建议吗？

Answer 1

好的，所以这涉及很多，但我认为它或多或少会做你想要的。我不打算解释整个事情，因为这需要相当多的时间。但是，我会说我从this blog post开始并添加了stop token filter。 "title"字段包含使用不同分析器的子字段（以前称为multi_field），或者没有。该查询包含几个terms aggregations。另请注意，匹配查询会过滤聚合结果，只返回与文本查询相关的结果。

这是索引设置（花一些时间来查看这个;如果你有特定的问题，我会尝试回答它们，但我鼓励你先浏览一下博客文章）：

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            },
            "stop_filter": {
               "type": "stop"
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter"
               ]
            },
            "stopword_only_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "asciifolding",
                  "stop_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "title": {
               "type": "string",
               "index_analyzer": "nGram_analyzer",
               "search_analyzer": "whitespace_analyzer",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "stopword_only": {
                     "type": "string",
                     "analyzer": "stopword_only_analyzer"
                  }
               }
            }
         }
      }
   }
}

然后我添加了一些文档：

PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}

现在我可以根据需要搜索带有字前缀的文件（或完整的单词，大写或不大写），并使用聚合返回匹配文档的完整标题，以及完整（非小写）单词，减去停用词：

POST /test_index/_search?search_type=count
{
    "query": {
      "match": {
         "title": {
            "query": "mer king",
            "operator": "or"
         }
      }
   }, 
    "aggs": {
        "word_tokens": {
            "terms": { "field": "title.stopword_only" }
        },
        "intact_titles": {
            "terms": { "field": "title.raw" }
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "intact_titles": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The Lion King",
               "doc_count": 1
            },
            {
               "key": "The Little Mermaid",
               "doc_count": 1
            }
         ]
      },
      "word_tokens": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The",
               "doc_count": 2
            },
            {
               "key": "King",
               "doc_count": 1
            },
            {
               "key": "Lion",
               "doc_count": 1
            },
            {
               "key": "Little",
               "doc_count": 1
            },
            {
               "key": "Mermaid",
               "doc_count": 1
            }
         ]
      }
   }
}

请注意，"The"会被返回。这似乎是因为默认的_english_停用词仅包含"the"。我没有立即找到解决方法。

以下是我使用的代码：

http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79

如果这有助于您解决问题，请告诉我。

Answer 2

您可以设置一个分析器，为您完成此操作。

如果您按照名为you complete me的教程进行操作，则会有一个关于停用词的部分。

撰写本文后，elasticsearch的工作方式发生了变化。 standard分析器没有记录器可以删除停用词，因此您需要使用stop分析器。

映射

curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
  "mappings": {
    "hotel" : {
      "properties" : {
        "name" : { "type" : "string" },
        "city" : { "type" : "string" },
        "name_suggest" : {
          "type" :            "completion",
          "index_analyzer" :  "stop",//NOTE HERE THE DIFFERENCE 
          "search_analyzer" : "stop",//FROM THE ARTICELE!!
          "preserve_position_increments": false,
          "preserve_separators": false
        }
      } 
    }
  }
}'

获得建议

curl -X POST localhost:9200/hotels/_suggest -d '
{
  "hotels" : {
    "text" : "m",
    "completion" : {
      "field" : "name_suggest"
    }
  }
}'

希望这会有所帮助。我自己花了很长时间寻找这个答案。

elasticsearch上下文建议词

2 个答案:

映射

获得建议