Question

我正在尝试根据Elasticsearch 2.1.1 中的span_near查询突出显示某个文档，并且ES错误地突出显示实际上不是命中的字词，因为它位于跨度问题。

我执行的步骤是：

创建索引

curl -XPUT 'http://localhost:9200/twitter/' -d '{
    "mappings": {
        "tweet": {
            "properties": {
                "message": {
                    "type": "string", 
                    "term_vector": "with_positions_offsets", 
                    "store": true
                }
            }
        }
    }
}'

为文档编制索引

curl -XPUT 'localhost:9200/twitter/tweet/1?refresh=true' -d '{
    "message" : "A new bonsai tree in the office. Bonsai!"
}'

搜索

curl -XGET 'http://localhost:9200/twitter/tweet/_search?pretty' -d '{
    "query" : {
        "span_near" : {
            "clauses" : [
                {"span_term": {"message": "new"}}, 
                {"span_term": {"message": "bonsai"}}
            ], 
            "slop": 1, 
            "in_order": false
        }
    }, 
    "highlight": {"fields": {"message": {"type": "plain"}}}
}'

上面的搜索正在返回：

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.13561106,
    "hits" : [ {
      "_index" : "twitter",
      "_type" : "tweet",
      "_id" : "1",
      "_score" : 0.13561106,
      "_source":{"message" : "A new bonsai tree in the office. Bonsai!"},
      "highlight" : {
        "message" : [ "A <em>new</em> <em>bonsai</em> tree in the office. <em>Bonsai</em>!" ]
      }
    } ]
  }
}

正如您所看到的，它错误地突出显示了＆＃34; Bonsai＆＃34;在字段的末尾，不在＆＃34; new＆＃34;的1个字内。有几点需要注意：

同样的一组步骤会针对Elasticsearch 1.5.2 生成正确突出显示结果。
使用快速矢量荧光笔（FVH）的span_near查询存在一个开放式错误 - https://github.com/elastic/elasticsearch/issues/5496 - 这就是我尝试使用plain的原因
为了突出显示使用span_near查询，我是否遗漏了一些内容？

Answer 1

事实证明，这是ES v2.1.1中的一个已知错误，它由此拉取请求修复：

https://github.com/elastic/elasticsearch/pull/15516。

根据PR上的标签，此错误修复将成为v2.1.2的一部分。

Answer 2

我回过头来在测试环境中玩了一下，我认为发生的事情是你误解了span_near查询正在做什么。我正在使用Sense这样做，所以语法上可能看起来有点不同，但你应该能够跟随并重现这一点。

我首先通过映射创建了一个索引

PUT /testindex
{
   "mappings": {
      "post": {
         "properties": {
            "message": {
               "type": "string",
               "store": true,
               "analyzer": "english",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  }
               }
            }
         }
      }

我省略了你对term_vector的一个属性它对我的测试没有影响，我认为这是你尝试向量突出显示的遗留属性。

然后我用一些数据更新了索引

PUT /testindex/post/1
{
    "message": "Bonsai new. A new bonsai tree in the office. Bonsai!"
}

然后执行你的查询给了我相同的结果（不会发布，因为它与上面列出的相同）。

我认为混乱的出现在于模糊了高层管理员对span_near的影响。该查询正在搜索新的术语和盆景，其中一个正在运行。要测试此项，请添加以下条目：

PUT /testindex/post/2
{
    "message": "Bonsai blah blah new blah blah bonsai tree in the office Bonsai!"
}

运行查询不会返回任何结果，因为新建到盆景的距离现在大于1。将坡度改为5或6就可以让你返回匹配。

这与突出显示无关。突出显示的是查看与span查询无关的术语，但如果术语在返回的响应中，则突出显示将应用于我们看到的术语。突出显然经历了2.0+的一些变化，因为我们在转向2.0引擎后进行了一些重写。

根据我看到的更改突出显示现在似乎独立于查询，就像它应用于响应post事件一样。我可能在这方面错了，但它看起来好像完全符合预期。你看Bonsai强调，因为这是要搜索的术语之一。突出显示不仅仅考虑slop参数或span_near规则，因为结果中存在两个链式术语。

我们读取您输入的条目作为句子，但ES删除标点符号并查看空白区域是分隔符。索引和搜索您输入的内容会导致匹配，因为在1个时间间隔内有两个术语。然后根据搜索的条件对结果进行突出显示，而不是它们彼此接近的位置。

Answer 3

我经历了非常相似的事情，并在ES 1.7和2.3之间进行了比较，并在ES讨论板上进行了编写。如果有人想跟踪，它现在是一个github问题：https://github.com/elastic/elasticsearch/issues/18035

如何在Elasticsearch中突出显示span_near查询？

3 个答案: