使用下一个单词建议在Elasticsearch中自动完成

时间:2018-05-23 09:47:47

标签: elasticsearch autocomplete

我想用elasticsearch实现自动完成,我无法做到。 我想要这样的问题here。我尝试了建议的答案但是徒劳无功。 我希望得到以下内容:

我的索引字符串用于例如:

  • “Developpeur Java”
  • “Developpeur C#”
  • “Je suis Developpeur”
  • “Jesuisécrivan”
  • “Il est developpeur C ++”

对于输入“develop”,我想作为输出:

  • “Developpeur”
  • “Developpeur Java”
  • “Developpeur C#”
  • “Developpeur C ++”

对于输入“developpeur”,我想作为输出:

  • “developpeur Java”
  • “developpeur C#”
  • “developpeur C ++”

输入“suis”,我想作为输出:

  • “suis developpeur”
  • “suisécrivan”

我尝试使用完成建议器来实现此目的:

这是我正在使用的弹性搜索:

"number": "6.2.2",
"build_hash": "10b1edd",
"build_date": "2018-02-16T19:01:30.685723Z",
"build_snapshot": false,
"lucene_version": "7.2.1",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"

映射:

{
"settings": {
    "number_of_shards": "1",
    "analysis": {
        "filter": {
            "prefix_filter": {
                "type": "edge_ngram",
                "min_gram": 1,
                "max_gram": 20
            },
            "ngram_filter": {
                "type": "nGram",
                "min_gram": "3",
                "max_gram": "3"
            },
            "synonym_filter": {
                "type": "synonym",
                "synonyms": [
                    "hackwillbereplacedatindexcreation,hackwillbereplacedatindexcreation"
                ]
            },
            "french_stop": {
                "type": "stop",
                "stopwords": "french"
            }
        },
        "analyzer": {
            "word": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "french_stop"
                ],
                "char_filter": []
            },
            "prefix": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "synonym_filter",
                    "prefix_filter"
                ],
                "char_filter": []
            },
            "ngram_with_synonyms": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "synonym_filter",
                    "ngram_filter"
                ],
                "char_filter": []
            },
            "ngram": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": [
                    "lowercase",
                    "asciifolding",
                    "ngram_filter"
                ],
                "char_filter": []
            }
        }
    }
},
"mappings": {
    "training": {
        "properties": {
            "id": {
                "type": "text",
                "index": false
            },
            "label": {
                "type": "text",
                "index_options": "docs",
                "copy_to": "full_label",
                "analyzer": "word",
                "fields": {
                    "prefix": {
                        "type": "text",
                        "index_options": "docs",
                        "analyzer": "prefix",
                        "search_analyzer": "word"
                    },
                    "ngram": {
                        "type": "text",
                        "index_options": "docs",
                        "analyzer": "ngram_with_synonyms",
                        "search_analyzer": "ngram"
                    }
                }
            },
            "labelSuggest": {
                "type": "completion",
                "analyzer": "word"
            },
        }
    }
}

然后当我用我的数据创建索引时,我这样做(这是对ES api进行put调用的主体,我正在使用pyhon):

body = {
    "label": r["title"],
    "labelSuggest": {
        "input": r["title"].ngrams()
    },
    "weight": 1.
}

r [“title”]。ngrams()获取标题的所有ngrams。例如: “发展研究生物技术”将给予:“发展”,“研究”,“生物技术”,“发展研究”,“研究生物技术”和“发展研究生物技术”

然后打电话给suggseter,我这样做:

   POST  http://localhost:9200/training/_search?pretty
{
    "suggest": {
        "labelSuggest": {
            "text": "developpeur",
            "completion": {
                "field": "labelSuggest",
                "skip_duplicates": true

            }
        }
    }
}

结果是:

{
    "text": "développement",
    "_index": "activity_20180518092449",
    "_type": "activity",
    "_id": "2031ce8b-6589-3270-afdf-7901aa21efa1",
    "_score": 1,
    "_source": {
        "id": "2031ce8b-6589-3270-afdf-7901aa21efa1",
        "name": "development research biotech",
        "labelSuggest": [
            "development",
            "research",
            "biotech",
            "development research",
            "research biotech",
            "development research biotech"
        ]
    }

但我想要的东西能给我:“发展”,“发展研究”和“发展研究生物技术”(假设我们只将该文件作为输入)

我正在做的映射/查询有什么问题? 这是正确的方法吗? 我希望我的问题很明确。我徒劳地搜索了很多。

提前致谢

1 个答案:

答案 0 :(得分:0)

首先,Ngram不会做你说的话。

这个:

"ngram_filter": {
            "type": "nGram",
            "min_gram": "3",
            "max_gram": "3"
        },

将从#34; developpeur Java" - > dev,eve,vel,elo ......等等。

在此处查看文档:{​​{3}}

第二......对于你想要的结果我只会使用一个带有过滤器的自定义分析器" icu_folding"和" engram"和一个空格标记器。 现在,我将从2开始,最多20-25。

这将从" developpeur Java"生成这样的令牌列表。 - > de,dev,deve,devel,develo,developp,developpe,devellopeu,developper ......等等。

然后,您在该字段上进行简单的术语搜索。如果它是该自动填充的下拉列表,您将在键入时返回记录。 希望我理解你的问题,我希望这会有所帮助。

更新: 试试这个:

"suggester": {
"type": "custom",
"tokenizer": "whitespace",
"filter": ["my_ngram_filter", "icu_folding"],
"char_filter": []
}
"my_ngram_filter" is: "my_ngram_filter": {
    "type": "edge_ngram",
    "min_gram": "2",
    "max_gram": "20"
}

然后在该字段上的映射应该看起来像

"labelSuggest": {
            "type": "text",
            "analyzer": "suggester"
        }

然后进行简单的搜索

  {
  "query": {
    "term": {
      "labelSuggest": "dev" 
    }
   }
  }