Question

我在ElasticSearch中创建了一个索引，其中包含字段名称，其中我存储了一个人的全名：Name和Surname。我想在该字段上执行全文搜索，因此我使用分析器对其进行索引。

我现在的问题是如果我搜索： “John Rham Rham”

在索引中，我有“John Rham Rham Luck”，这个值的得分高于“John Rham Rham”。是否有可能在确切的字段上获得比在字符串中具有更多值的字段更好的分数？

提前致谢！

Answer 1

我找到了一个小例子（假设您在ES 5.x上运行得分差异原因）：

DELETE test
PUT test
{
  "settings": {
    "similarity": {
      "my_bm25": {
        "type": "BM25",
        "b": 0
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "name": {
          "type": "text",
          "similarity": "my_bm25",
          "fields": {
            "length": {
              "type": "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

POST test/test/1
{
  "name": "John Rham Rham"
}
POST test/test/2
{
  "name": "John Rham Rham Luck"
}
GET test/_search
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "name": {
            "query": "John Rham Rham",
            "operator": "and"
          }
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "_score / doc['name.length'].getValue()"
          }
        }
      ]
    }
  }
}

此代码执行以下操作：

使用自定义BM25实现替换默认BM25实现，调整B参数（字段长度规范化） - 您也可以将相似度更改为＆＃39; classic＆＃39;回到没有这种正常化的TF / IDF
为您的名称字段创建一个内部字段，用于计算名称字段中的代币数。
根据令牌的长度更新分数

这将导致：

"hits": {
    "total": 2,
    "max_score": 0.3596026,
    "hits": [
      {
        "_index": "test",
        "_type": "test",
        "_id": "1",
        "_score": 0.3596026,
        "_source": {
          "name": "John Rham Rham"
        }
      },
      {
        "_index": "test",
        "_type": "test",
        "_id": "2",
        "_score": 0.26970196,
        "_source": {
          "name": "John Rham Rham Luck"
        }
      }
    ]
  }
}

不确定这是否是最佳方式，但它可能指向正确的方向:)

在ElasticSearch中搜索名称

1 个答案: