elasticsearch 1.6使用木瓦过滤器进行现场规范计算

时间:2015-06-29 03:20:44

标签: elasticsearch

我试图理解elasticsearch(1.6)中用于使用木瓦分析器索引的文档的fieldnorm计算 - 它似乎不包括带状疱疹的术语。如果是这样,是否可以将计算配置为包含叠瓦项?具体来说,这是我使用的分析仪:

{
  "index" : {
    "analysis" : {
        "filter" : {
            "shingle_filter" : {
                "type" : "shingle",
                "max_shingle_size" : 3
            }
        },
        "analyzer" : {
            "my_analyzer" : {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["word_delimiter", "lowercase", "shingle_filter"]
            }
        }  
    }
 }

}

这是使用的映射:

{
    "docs": {
        "properties": {
            "text" : {"type": "string", "analyzer" : "my_analyzer"}
        }
    }
}

我发布了一些文件:

{"text" : "the"}
{"text" : "the quick"}
{"text" : "the quick brown"}
{"text" : "the quick brown fox jumps"}
...

将以下查询与说明API一起使用时,

{
    "query": {
        "match": {
            "text" : "the"
        }
    }
}

我得到以下fieldnorms(为简洁省略了其他细节):

"_source": {
    "text": "the quick"
},
"_explanation": {
    "value": 0.625,
    "description": "fieldNorm(doc=0)"
}

"_source": {
    "text": "the quick brown fox jumps over the"
},
"_explanation": {
    "value": 0.375,
    "description": "fieldNorm(doc=0)"
}

值似乎表明ES看到第1个文件的2个术语(“快速”)和第2个文件的7个术语(“快速棕色狐狸跳过”),不包括带状疱疹。是否可以配置ES来计算带有叠瓦项的字段范数(即分析仪返回的所有项)?

1 个答案:

答案 0 :(得分:1)

您需要通过禁用折扣重叠标记来自定义default similarity

示例:

{
  "index" : {
      "similarity" : {
          "no_overlap" : {
            "type" : "default",
            "discount_overlaps" : false
          } 
    },
    "analysis" : {
        "filter" : {
            "shingle_filter" : {
                "type" : "shingle",
                "max_shingle_size" : 3
            }
        },
        "analyzer" : {
            "my_analyzer" : {
                "type" : "custom",
                "tokenizer" : "standard",
                "filter" : ["word_delimiter", "lowercase", "shingle_filter"]
            }
        }  
    }
 }
}

映射:

{
    "docs": {
        "properties": {
            "text" : {"type": "string", "analyzer" : "my_analyzer", "similarity
" : "no_overlap"}
        }
    }
}

进一步扩展:

默认重叠,即计算范数

时忽略具有0位置增量的标记

下面的示例显示了OP中描述的“ my_analyzer ”生成的令牌的位置:

get <index_name>/_analyze?field=text&text=the quick

{
   "tokens": [
      {
         "token": "the",
         "start_offset": 0,
         "end_offset": 3,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "the quick",
         "start_offset": 0,
         "end_offset": 9,
         "type": "shingle",
         "position": 1
      },
      {
         "token": "quick",
         "start_offset": 4,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 2
      }
   ]
}

根据lucene documentation,默认相似度的长度范数计算实现如下:

state.getBoost()*lengthNorm(numTerms)

其中 numTerms

if setDiscountOverlaps(boolean) is false
  FieldInvertState.getLength() 
else 
   FieldInvertState.getLength() - FieldInvertState.getNumOverlap()