Question

我正在尝试使名称字段在Elasticsearch中正常工作，并且在寻找指导方面遇到困难。请帮我，互联网！

我的文档有多个作者，因此有一个多值名称字段。假设我搜索了paul f tompkins，并搜索了两个文档：{"authors": ["Paul Tompkins", "Dietrich Kohl"]}和{"authors": ["Paul Wang", "Darlene Tompkins"]}。

我的搜索将足够容易地检索两个文档，但是从authors查询中两个文档的得分相同。我想在authors数组的同一项目中匹配多个词，以提高第一个文档的得分。

我该怎么做？我知道的两种用于增强邻近性的技术是带状疱疹（我相信它们会生成paul_f和f_tompkins带状疱疹，但都不匹配）和带斜率的词组查询（因为f令牌不存在）。

理想情况下，我希望使用类似minimum_should_match的词组倾斜查询：我给它四个词，如果在同一数组元素中至少存在两个，则它匹配，并且每个额外的匹配项在同一数组中元素提高分数。我不知道该怎么做。

（对于我来说，拥有试图从查询中剥离f的客户端逻辑是不起作用的-这是一个简化的示例，但是假设我也希望能够处理paul francis tompkins或paul f tompkins there will be blood之类的查询。）

Answer 1

两个文档得分均相同的原因是因为author字段是文本值数组。如果我们改变存储作者的方式，我们将获得理想的结果。为此，让作者成为nested类型。因此，我们有以下映射：

"mappings": {
  "_doc": {
    "properties": {
      "authors": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "text",
            "fields": {
              "raw": {
                "type": "keyword"
              }
            }
          }
        }
      }
    }
  }
}

注意：子字段 原始 可以用于其他一些情况，并且没有任何关系解决方案。

现在让我们为文档编制索引如下：

Doc 1：

{
  "authors": [
    {
      "name": "Paul Tompkins"
    },
    {
      "name": "Dietrich Kohl"
    }
  ]
}

文档2：

{
  "authors": [
    {
      "name": "Paul Wang"
    },
    {
      "name": "Darlene Tompkins"
    }
  ]
}

让他们查询如下：

{
  "explain": true,
  "query": {
    "nested": {
      "path": "authors",
      "query": {
        "query_string": {
          "query": "paul l tompkins",
          "fields": [
            "authors.name"
          ]
        }
      }
    }
  }
}

结果：

  "hits": {
    "total": 2,
    "max_score": 1.3862944,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.3862944,
        "_source": {
          "authors": [
            {
              "name": "Paul Tompkins"
            },
            {
              "name": "Dietrich Kohl"
            }
          ]
        }
      },
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "authors": [
            {
              "name": "Paul Wang"
            },
            {
              "name": "Darlene Tompkins"
            }
          ]
        }
      }
    ]
  }

注意：在查询中，我还使用了 explain：true 。这给出了分数计算的说明。（由于它很长，我没有在上面包括explain输出。您可以尝试一下）。

查看评分机制时，我们可以看到在嵌套字段查询和在数组查询时的区别。广义上讲，由于嵌套字段存储为单独的文档，因此 Doc 1 得分较高，因为子Doc 1即：

{
  "name": "Paul Tompkins"
}

得分较高，因为 paul 和 tompkins 这两个词都在同一个子文档中。

对于数组，所有名称都属于同一字段，而不是作为单独的子文档，因此有所区别。

这样我们可以达到预期的结果。

搜索名称：在多值字段中增加邻近匹配的相关性

1 个答案: