我正在为Elasticsearch中的英国地址构建一个地址匹配引擎,并且发现带状疱疹非常有用但是我在标点符号时遇到了一些问题。查询" 4 Walmley Close"正在返回以下比赛:
真正的比赛是3号,但是1和2都匹配(错误地),因为他们都变成了#4; 4 walmley'什么时候变成了带状疱疹。我想告诉shingle分析仪不会产生跨越逗号的带状疱疹。所以,例如1)目前我得到:
......实际上我想要的只是......
我目前的设置如下。我已经尝试将标记生成器从标准交换到空格,这有助于它保留逗号并可能避免上述情况(即我最终使用' 4,walmley'作为我在地址1和2)但是我的索引中有很多无法使用的带状疱疹,我需要7000万个文件来保持索引大小。
正如您在索引设置中所看到的,我还有一个street_sym过滤器,我希望能够在我的带状疱疹中使用,例如对于这个例子,除了生成'walmley close'我想要' walmley cl'然而,当我试图将其包括在内时,我得到了一些关于这种情况的感觉。这不是非常有帮助!
来自更有经验的Elasticsearch用户的任何建议都将受到极大的赞赏。我读过Gormley和Tong的优秀书籍,但无法理解这一特定问题。
提前感谢您提供的任何帮助。
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"output_unigrams": false
},
"street_sym": {
"type": "synonym",
"synonyms": [
"st => street",
"rd => road",
"ave => avenue",
"ct => court",
"ln => lane",
"terr => terrace",
"cir => circle",
"hwy => highway",
"pkwy => parkway",
"cl => close",
"blvd => boulevard",
"dr => drive",
"ste => suite",
"wy => way",
"tr => trail"
]
}
},
"analyzer": {
"shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
]
}
}
}
答案 0 :(得分:0)
请参阅我对您的问题的评论,为什么解决方案仍然不会阻止“4 Walmley Close”匹配您提供的所有三场比赛。但是,至少可以获得所需的令牌。我不确定它是最优雅/高性能的解决方案,但在你的带状疱疹上使用Pattern Replace,Pattern Capture和Length过滤器似乎可以解决问题:
"analysis": {
"filter": {
"shingle": {
"type": "shingle",
"output_unigrams": false
},
"street_sym": {
"type": "synonym",
"synonyms": [
"st => street",
"rd => road",
"ave => avenue",
"ct => court",
"ln => lane",
"terr => terrace",
"cir => circle",
"hwy => highway",
"pkwy => parkway",
"cl => close",
"blvd => boulevard",
"dr => drive",
"ste => suite",
"wy => way",
"tr => trail"
]
},
"no_middle_comma": {
"type": "pattern_replace",
"pattern": ".+,.+",
"replacement": ""
},
"no_trailing_comma": {
"type": "pattern_capture",
"preserve_original": false,
"patterns": [
"(.*),"
]
},
"not_empty": {
"type": "length",
"min": 1
}
},
"analyzer": {
"test": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"street_sym",
"shingle",
"no_middle_comma",
"no_trailing_comma",
"not_empty"
]
}
}
}
no_middle_comma
:用空标记替换任何带有逗号的令牌no_trailing_comma
:用逗号前的部分替换以逗号结尾的任何令牌not_empty
:删除上述例如,“3号和4号单位,Walmley Chambers,3 Walmley Cl”成为:
{
"tokens": [
{
"token": "units 3",
"start_offset": 0,
"end_offset": 7,
"type": "shingle",
"position": 0
},
{
"token": "3 and",
"start_offset": 6,
"end_offset": 11,
"type": "shingle",
"position": 1
},
{
"token": "and 4",
"start_offset": 8,
"end_offset": 14,
"type": "shingle",
"position": 2
},
{
"token": "walmley chambers",
"start_offset": 15,
"end_offset": 32,
"type": "shingle",
"position": 4
},
{
"token": "3 walmley",
"start_offset": 33,
"end_offset": 42,
"type": "shingle",
"position": 6
},
{
"token": "walmley close",
"start_offset": 35,
"end_offset": 45,
"type": "shingle",
"position": 7
}
]
}
请注意,您的同义词过滤器有效:“Walmley Cl”变成了“walmley close”。