我试图理解elasticsearch(1.6)中用于使用木瓦分析器索引的文档的fieldnorm计算 - 它似乎不包括带状疱疹的术语。如果是这样,是否可以将计算配置为包含叠瓦项?具体来说,这是我使用的分析仪:
{
"index" : {
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
这是使用的映射:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer"}
}
}
}
我发布了一些文件:
{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...
将以下查询与说明API一起使用时,
{
"query": {
"match": {
"text" : "the"
}
}
}
我得到以下fieldnorms(为简洁省略了其他细节):
"_source": {
"text": "the quick"
},
"_explanation": {
"value": 0.625,
"description": "fieldNorm(doc=0)"
}
"_source": {
"text": "the quick brown fox jumps over the"
},
"_explanation": {
"value": 0.375,
"description": "fieldNorm(doc=0)"
}
值似乎表明ES看到第1个文件的2个术语(“快速”)和第2个文件的7个术语(“快速棕色狐狸跳过”),不包括带状疱疹。是否可以配置ES来计算带有叠瓦项的字段范数(即分析仪返回的所有项)?
答案 0 :(得分:1)
您需要通过禁用折扣重叠标记来自定义default similarity。
示例:
{
"index" : {
"similarity" : {
"no_overlap" : {
"type" : "default",
"discount_overlaps" : false
}
},
"analysis" : {
"filter" : {
"shingle_filter" : {
"type" : "shingle",
"max_shingle_size" : 3
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : ["word_delimiter", "lowercase", "shingle_filter"]
}
}
}
}
}
映射:
{
"docs": {
"properties": {
"text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
}
}
}
进一步扩展:
默认重叠,即计算范数
时忽略具有0位置增量的标记下面的示例显示了OP中描述的“ my_analyzer ”生成的令牌的位置:
get <index_name>/_analyze?field=text&text=the quick
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "the quick",
"start_offset": 0,
"end_offset": 9,
"type": "shingle",
"position": 1
},
{
"token": "quick",
"start_offset": 4,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 2
}
]
}
根据lucene documentation,默认相似度的长度范数计算实现如下:
state.getBoost()*lengthNorm(numTerms)
其中 numTerms 是
if setDiscountOverlaps(boolean) is false
FieldInvertState.getLength()
else
FieldInvertState.getLength() - FieldInvertState.getNumOverlap()