需要针对ElasticSearch中的文本进行多次匹配

时间:2014-05-29 14:14:25

标签: filter elasticsearch

我正在尝试针对ElasticSearch创建一个过滤器,该过滤器在返回结果之前需要多个匹配项。例如,在以下文本中:

  

如果你对骑自行车的想法感到不安,那就等到你看到谷歌的新车了。它没有油门踏板,没有刹车,没有方向盘。 Google 多年来一直在通过改装丰田汽车,雷克萨斯汽车以及其他带摄像头和传感器的汽车来展示其无人驾驶技术。但现在,该公司首次推出了自己的原型车:一辆可爱的小型车,看起来像大众甲壳虫和高尔夫球车之间的交叉。

如果我将最小匹配数设置为2并搜索Google,我会期望此结果,因为Google会在文本中出现两次。但是,使用相同数量的预期匹配搜索Toyota不应导致本文。

如何构建此过滤器?

1 个答案:

答案 0 :(得分:1)

可能不完全是您正在寻找的内容,但您可以在查询中添加解释,然后在客户端按术语匹配次数进行过滤。从文档中,查询将如下所示:

GET /_search?explain 
{
   "query"   : { "match" : { "tweet" : "honeymoon" }}
}

结果如下:

"_explanation": { 
   "description": "weight(tweet:honeymoon in 0)
                  [PerFieldSimilarity], result of:",
   "value":       0.076713204,
   "details": [
      {
         "description": "fieldWeight in 0, product of:",
         "value":       0.076713204,
         "details": [
            {  
               "description": "tf(freq=1.0), with freq of:",
               "value":       1,
               "details": [
                  {
                     "description": "termFreq=1.0",
                     "value":       1
                  }
               ]
            },
            { 
               "description": "idf(docFreq=1, maxDocs=1)",
               "value":       0.30685282
            },
            { 
               "description": "fieldNorm(doc=0)",
               "value":        0.25,
            }
         ]
      }
   ]
}

然后,您可以在术语频率的说明字段中进行过滤,并查找值> 1。

我相信您可以使用脚本直接执行此操作(无客户端过滤),因为您可以参考术语频率:

Term statistics:

Term statistics for a field can be accessed with a subscript operator like this: _index['FIELD']['TERM']. This will never return null, even if term or field does not exist. If you do not need the term frequency, call _index['FIELD'].get('TERM', 0) to avoid uneccesary initialization of the frequencies. The flag will have only affect is your set the index_options to docs (see mapping documentation).

_index['FIELD']['TERM'].df()
    df of term TERM in field FIELD. Will be returned, even if the term is not present in the current document. 
_index['FIELD']['TERM'].ttf()
    The sum of term frequencys of term TERM in field FIELD over all documents. Will be returned, even if the term is not present in the current document. 
_index['FIELD']['TERM'].tf()
    tf of term TERM in field FIELD. Will be 0 if the term is not present in the current document. 

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-scripting.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-advanced-scripting.html

但是,我没有这样做,使用服务器端脚本时,安全性和性能都存在正常问题。