我需要自动填充phrases。例如,当我在alz" 中搜索"痴呆症时,我想在阿尔茨海默氏症" 中得到"痴呆症。
为此,我配置了Edge NGram tokenizer。我在查询正文中尝试了edge_ngram_analyzer
和standard
作为分析器。然而,当我试图匹配一个短语时,我无法得到结果。
我做错了什么?
我的查询:
{
"query":{
"multi_match":{
"query":"dementia in alz",
"type":"phrase",
"analyzer":"edge_ngram_analyzer",
"fields":["_all"]
}
}
}
我的映射:
...
"type" : {
"_all" : {
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
"properties" : {
"field" : {
"type" : "string",
"analyzer" : "edge_ngram_analyzer",
"search_analyzer" : "standard"
},
...
"settings" : {
...
"analysis" : {
"filter" : {
"stem_possessive_filter" : {
"name" : "possessive_english",
"type" : "stemmer"
}
},
"analyzer" : {
"edge_ngram_analyzer" : {
"filter" : [ "lowercase" ],
"tokenizer" : "edge_ngram_tokenizer"
}
},
"tokenizer" : {
"edge_ngram_tokenizer" : {
"token_chars" : [ "letter", "digit", "whitespace" ],
"min_gram" : "2",
"type" : "edgeNGram",
"max_gram" : "25"
}
}
}
...
我的文件:
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfHfBE5CzEm8aJ3Xp",
"_source": {
"@timestamp": "2016-08-02T13:40:48.665Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1400541",
"Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset",
"@version": "1",
},
"_index": "carenotes"
},
{
"_score": 1.1152233,
"_type": "Diagnosis",
"_id": "AVZLfICrE5CzEm8aJ4Dc",
"_source": {
"@timestamp": "2016-08-02T13:40:51.240Z",
"type": "Diagnosis",
"Document_ID": "Diagnosis_1424351",
"Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset",
"@version": "1",
},
"_index": "carenotes"
}
分析阿尔茨海默病"老年痴呆症" 短语:
{
"tokens": [
{
"end_offset": 2,
"token": "de",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "dem",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 4,
"token": "deme",
"type": "word",
"start_offset": 0,
"position": 2
},
{
"end_offset": 5,
"token": "demen",
"type": "word",
"start_offset": 0,
"position": 3
},
{
"end_offset": 6,
"token": "dement",
"type": "word",
"start_offset": 0,
"position": 4
},
{
"end_offset": 7,
"token": "dementi",
"type": "word",
"start_offset": 0,
"position": 5
},
{
"end_offset": 8,
"token": "dementia",
"type": "word",
"start_offset": 0,
"position": 6
},
{
"end_offset": 9,
"token": "dementia ",
"type": "word",
"start_offset": 0,
"position": 7
},
{
"end_offset": 10,
"token": "dementia i",
"type": "word",
"start_offset": 0,
"position": 8
},
{
"end_offset": 11,
"token": "dementia in",
"type": "word",
"start_offset": 0,
"position": 9
},
{
"end_offset": 12,
"token": "dementia in ",
"type": "word",
"start_offset": 0,
"position": 10
},
{
"end_offset": 13,
"token": "dementia in a",
"type": "word",
"start_offset": 0,
"position": 11
},
{
"end_offset": 14,
"token": "dementia in al",
"type": "word",
"start_offset": 0,
"position": 12
},
{
"end_offset": 15,
"token": "dementia in alz",
"type": "word",
"start_offset": 0,
"position": 13
},
{
"end_offset": 16,
"token": "dementia in alzh",
"type": "word",
"start_offset": 0,
"position": 14
},
{
"end_offset": 17,
"token": "dementia in alzhe",
"type": "word",
"start_offset": 0,
"position": 15
},
{
"end_offset": 18,
"token": "dementia in alzhei",
"type": "word",
"start_offset": 0,
"position": 16
},
{
"end_offset": 19,
"token": "dementia in alzheim",
"type": "word",
"start_offset": 0,
"position": 17
},
{
"end_offset": 20,
"token": "dementia in alzheime",
"type": "word",
"start_offset": 0,
"position": 18
},
{
"end_offset": 21,
"token": "dementia in alzheimer",
"type": "word",
"start_offset": 0,
"position": 19
}
]
}
答案 0 :(得分:14)
非常感谢帮助我找到正确解决方案的rendel!
Andrei Stefan的解决方案不是最佳解决方案。
为什么呢?首先,搜索分析器中没有小写滤波器使得搜索不方便;案件必须严格匹配。需要使用lowercase
过滤器的自定义分析器,而不是"analyzer": "keyword"
。
其次,分析部分错误!
在索引时间期间,通过edge_ngram_analyzer
分析字符串“ F00.0-早发性阿尔茨海默病中的痴呆”。使用此分析器,我们将以下字典数组作为分析字符串:
{
"tokens": [
{
"end_offset": 2,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 3,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 1
},
{
"end_offset": 6,
"token": "0 ",
"type": "word",
"start_offset": 4,
"position": 2
},
{
"end_offset": 9,
"token": " ",
"type": "word",
"start_offset": 7,
"position": 3
},
{
"end_offset": 10,
"token": " d",
"type": "word",
"start_offset": 7,
"position": 4
},
{
"end_offset": 11,
"token": " de",
"type": "word",
"start_offset": 7,
"position": 5
},
{
"end_offset": 12,
"token": " dem",
"type": "word",
"start_offset": 7,
"position": 6
},
{
"end_offset": 13,
"token": " deme",
"type": "word",
"start_offset": 7,
"position": 7
},
{
"end_offset": 14,
"token": " demen",
"type": "word",
"start_offset": 7,
"position": 8
},
{
"end_offset": 15,
"token": " dement",
"type": "word",
"start_offset": 7,
"position": 9
},
{
"end_offset": 16,
"token": " dementi",
"type": "word",
"start_offset": 7,
"position": 10
},
{
"end_offset": 17,
"token": " dementia",
"type": "word",
"start_offset": 7,
"position": 11
},
{
"end_offset": 18,
"token": " dementia ",
"type": "word",
"start_offset": 7,
"position": 12
},
{
"end_offset": 19,
"token": " dementia i",
"type": "word",
"start_offset": 7,
"position": 13
},
{
"end_offset": 20,
"token": " dementia in",
"type": "word",
"start_offset": 7,
"position": 14
},
{
"end_offset": 21,
"token": " dementia in ",
"type": "word",
"start_offset": 7,
"position": 15
},
{
"end_offset": 22,
"token": " dementia in a",
"type": "word",
"start_offset": 7,
"position": 16
},
{
"end_offset": 23,
"token": " dementia in al",
"type": "word",
"start_offset": 7,
"position": 17
},
{
"end_offset": 24,
"token": " dementia in alz",
"type": "word",
"start_offset": 7,
"position": 18
},
{
"end_offset": 25,
"token": " dementia in alzh",
"type": "word",
"start_offset": 7,
"position": 19
},
{
"end_offset": 26,
"token": " dementia in alzhe",
"type": "word",
"start_offset": 7,
"position": 20
},
{
"end_offset": 27,
"token": " dementia in alzhei",
"type": "word",
"start_offset": 7,
"position": 21
},
{
"end_offset": 28,
"token": " dementia in alzheim",
"type": "word",
"start_offset": 7,
"position": 22
},
{
"end_offset": 29,
"token": " dementia in alzheime",
"type": "word",
"start_offset": 7,
"position": 23
},
{
"end_offset": 30,
"token": " dementia in alzheimer",
"type": "word",
"start_offset": 7,
"position": 24
},
{
"end_offset": 33,
"token": "s ",
"type": "word",
"start_offset": 31,
"position": 25
},
{
"end_offset": 34,
"token": "s d",
"type": "word",
"start_offset": 31,
"position": 26
},
{
"end_offset": 35,
"token": "s di",
"type": "word",
"start_offset": 31,
"position": 27
},
{
"end_offset": 36,
"token": "s dis",
"type": "word",
"start_offset": 31,
"position": 28
},
{
"end_offset": 37,
"token": "s dise",
"type": "word",
"start_offset": 31,
"position": 29
},
{
"end_offset": 38,
"token": "s disea",
"type": "word",
"start_offset": 31,
"position": 30
},
{
"end_offset": 39,
"token": "s diseas",
"type": "word",
"start_offset": 31,
"position": 31
},
{
"end_offset": 40,
"token": "s disease",
"type": "word",
"start_offset": 31,
"position": 32
},
{
"end_offset": 41,
"token": "s disease ",
"type": "word",
"start_offset": 31,
"position": 33
},
{
"end_offset": 42,
"token": "s disease w",
"type": "word",
"start_offset": 31,
"position": 34
},
{
"end_offset": 43,
"token": "s disease wi",
"type": "word",
"start_offset": 31,
"position": 35
},
{
"end_offset": 44,
"token": "s disease wit",
"type": "word",
"start_offset": 31,
"position": 36
},
{
"end_offset": 45,
"token": "s disease with",
"type": "word",
"start_offset": 31,
"position": 37
},
{
"end_offset": 46,
"token": "s disease with ",
"type": "word",
"start_offset": 31,
"position": 38
},
{
"end_offset": 47,
"token": "s disease with e",
"type": "word",
"start_offset": 31,
"position": 39
},
{
"end_offset": 48,
"token": "s disease with ea",
"type": "word",
"start_offset": 31,
"position": 40
},
{
"end_offset": 49,
"token": "s disease with ear",
"type": "word",
"start_offset": 31,
"position": 41
},
{
"end_offset": 50,
"token": "s disease with earl",
"type": "word",
"start_offset": 31,
"position": 42
},
{
"end_offset": 51,
"token": "s disease with early",
"type": "word",
"start_offset": 31,
"position": 43
},
{
"end_offset": 52,
"token": "s disease with early ",
"type": "word",
"start_offset": 31,
"position": 44
},
{
"end_offset": 53,
"token": "s disease with early o",
"type": "word",
"start_offset": 31,
"position": 45
},
{
"end_offset": 54,
"token": "s disease with early on",
"type": "word",
"start_offset": 31,
"position": 46
},
{
"end_offset": 55,
"token": "s disease with early ons",
"type": "word",
"start_offset": 31,
"position": 47
},
{
"end_offset": 56,
"token": "s disease with early onse",
"type": "word",
"start_offset": 31,
"position": 48
}
]
}
如您所见,整个字符串使用2到25个字符的标记大小进行标记。字符串以线性方式标记化,并且每个新标记的所有空格和位置都加1。
它有几个问题:
edge_ngram_analyzer
生成无用的令牌,永远不会搜索,例如:“ 0 ”,“”,“ d “,” sd “,” s疾病w “等。"max_gram" : "25"
我们“丢失”所有字段中的某些文字。您不能再搜索此文本,因为它没有令牌。trim
过滤器仅在标记器可以完成时过滤额外空格的问题。 edge_ngram_analyzer
增加每个标记的位置,这对于诸如短语查询之类的位置查询是有问题的。在生成ngrams时,应使用edge_ngram_filter
代替保留令牌的位置。要使用的映射设置:
...
"mappings": {
"Type": {
"_all":{
"analyzer": "edge_ngram_analyzer",
"search_analyzer": "keyword_analyzer"
},
"properties": {
"Field": {
"search_analyzer": "keyword_analyzer",
"type": "string",
"analyzer": "edge_ngram_analyzer"
},
...
...
"settings": {
"analysis": {
"filter": {
"english_poss_stemmer": {
"type": "stemmer",
"name": "possessive_english"
},
"edge_ngram": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25",
"token_chars": ["letter", "digit"]
}
},
"analyzer": {
"edge_ngram_analyzer": {
"filter": ["lowercase", "english_poss_stemmer", "edge_ngram"],
"tokenizer": "standard"
},
"keyword_analyzer": {
"filter": ["lowercase", "english_poss_stemmer"],
"tokenizer": "standard"
}
}
}
}
...
看看分析:
{
"tokens": [
{
"end_offset": 5,
"token": "f0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 5,
"token": "f00.0",
"type": "word",
"start_offset": 0,
"position": 0
},
{
"end_offset": 17,
"token": "de",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dem",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "deme",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "demen",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dement",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementi",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 17,
"token": "dementia",
"type": "word",
"start_offset": 9,
"position": 2
},
{
"end_offset": 20,
"token": "in",
"type": "word",
"start_offset": 18,
"position": 3
},
{
"end_offset": 32,
"token": "al",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alz",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzh",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhe",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzhei",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheim",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheime",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 32,
"token": "alzheimer",
"type": "word",
"start_offset": 21,
"position": 4
},
{
"end_offset": 40,
"token": "di",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dis",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "dise",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disea",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "diseas",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 40,
"token": "disease",
"type": "word",
"start_offset": 33,
"position": 5
},
{
"end_offset": 45,
"token": "wi",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "wit",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 45,
"token": "with",
"type": "word",
"start_offset": 41,
"position": 6
},
{
"end_offset": 51,
"token": "ea",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "ear",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "earl",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 51,
"token": "early",
"type": "word",
"start_offset": 46,
"position": 7
},
{
"end_offset": 57,
"token": "on",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "ons",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onse",
"type": "word",
"start_offset": 52,
"position": 8
},
{
"end_offset": 57,
"token": "onset",
"type": "word",
"start_offset": 52,
"position": 8
}
]
}
在索引时间上,standard
令牌化程序会对文本进行标记,然后通过lowercase
,possessive_english
和edge_ngram
过滤器对单独的字词进行过滤。 只为单词生成标记。
在搜索时,standard
标记生成器会对文本进行标记,然后lowercase
和possessive_english
会对单独的文字进行过滤。搜索到的单词与在索引时间内创建的标记匹配。
因此我们可以进行增量搜索!
现在,因为我们在单独的单词上做ngram,我们甚至可以执行像
这样的查询{
'query': {
'multi_match': {
'query': 'dem in alzh',
'type': 'phrase',
'fields': ['_all']
}
}
}
并获得正确的结果。
没有任何文字“丢失”,一切都是可搜索的,并且不再需要trim
过滤器来处理空格。
答案 1 :(得分:8)
我认为您的查询错误:虽然您在索引时需要nGrams,但在搜索时不需要它们。在搜索时,您需要尽可能“固定”文本。 请尝试此查询:
{
"query": {
"multi_match": {
"query": " dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}
您注意到dementia
之前有两个空格。这些由您的分析器从文本中解释。要摆脱那些你需要trim
token_filter:
"edge_ngram_analyzer": {
"filter": [
"lowercase","trim"
],
"tokenizer": "edge_ngram_tokenizer"
}
然后这个查询将起作用(dementia
之前没有空格):
{
"query": {
"multi_match": {
"query": "dementia in alz",
"analyzer": "keyword",
"fields": [
"_all"
]
}
}
}