Question

当我应该使用分析器，过滤器和查询时，我正试图解决这个问题。我已经阅读了elastic.co网站上的Search in Depth文章，并且有了更好的理解，但这些例子对我的用例来说很天真，但仍然有点混乱。

鉴于我的文件中包含一系列成分，包含digestive biscuits，biscuits，cheese和chocolate的任意组合，我试图弄清楚什么是分析数据并对其进行搜索的最佳方式。

这是一组简单的文件：

[{
    "ingredients": ["cheese", "chocolate"]
}, {
    "ingredients": ["chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "biscuits"]
}, {
    "ingredients": ["chocolate", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "digestive biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "biscuits"]
}, {
    "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}]

（我故意不在这里混合biscuits和digestive biscuits，我会在一个月内解释。）

我有一个搜索字段，允许人们自由输入他们选择的任何成分，我现在将其拆分为空格，以便为我提供一系列术语。

我有这样的映射：

{
    "properties": {
        "ingredients": {
            "type": "string",
            "analyzer": "keyword"
        }
    }
}

我在这里面临的问题是biscuits与digestive biscuits不匹配，而biscuit与任何内容都不匹配。

我知道我必须使用snowball分析仪分析该字段，但我不确定如何执行此操作。

我需要多场方法吗？我是否还需要使用过滤器进行查询？我希望看到的结果是：

biscuit匹配biscuits和digestive biscuits（后者评分较低）
biscuits匹配biscuits和digestive biscuits（后者评分较低）
digestive匹配digestive biscuits
digestive biscuits匹配自己和biscuits（后者评分较低）

此外，随机抛出任何其他术语，我该如何处理？使用过滤器或查询？

非常困惑的是如何通过映射和搜索来构建这个权利，所以如果有人有任何示例建议，我会非常感激。

Answer 1

首先，我建议您阅读：https://www.elastic.co/guide/en/elasticsearch/guide/current/stemming.html

它讨论了你所面临的确切问题。

所以要解决这个问题，你必须使用自定义分析器（它是使用字符过滤器，标记器和过滤器构建的）。分析器从文本blob中发出令牌。

因此，在您的具体情况下，我将向您展示如何创建一个简单的自定义分析器来实现您的目标：

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer_custom": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "asciifolding",
            "lowercase",
            "kstem"
          ]
        }
      }
    }
  },
  "mappings": {
    "data": {
      "properties": {
        "ingredients": {
          "type": "string",
          "analyzer": "my_analyzer_custom"
        }
      }
    }
  }
}

此分析器将使用标准分词器分割您的文本并应用这些过滤器：

asciifolding - 使用重音字符（é=＆gt; e）
lowercase - 小写令牌，以便搜索不区分大小写
kstem - 过滤器，将标记规范化为其根形式（不理想，但做得很好）。在这种情况下，它会将饼干标准化为饼干

所以有你的样本数据：

PUT /test/data/1
{
  "ingredients": ["cheese", "chocolate"]
}
PUT /test/data/2
{
  "ingredients": ["chocolate", "biscuits"]
}
PUT /test/data/3
{
  "ingredients": ["cheese", "biscuits"]
}
PUT /test/data/4
{
  "ingredients": ["chocolate", "digestive biscuits"]
}
PUT /test/data/5
{
  "ingredients": ["cheese", "digestive biscuits"]
}
PUT /test/data/6
{
  "ingredients": ["cheese", "chocolate", "biscuits"]
}
PUT /test/data/7
{
  "ingredients": ["cheese", "chocolate", "digestive biscuits"]
}

这个查询：

GET /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.5,
      "queries": [
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "type": "phrase",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits",
              "operator": "and",
              "boost": 3
            }
          }
        },
        {
          "match": {
            "ingredients": {
              "query": "digestive biscuits"
            }
          }
        }
      ]
    }
  }
}

在这种情况下，我使用了Dis Max Query。你看到有一系列的查询？我们在那里指定多个查询，它会带回最高分的文档。来自文档：

生成由其生成的文档的并集的查询子查询，并为每个文档评分最高分数该文档由任何子查询生成，加上打破平局任何其他匹配子查询的增量。

所以在这种情况下我指定了三个查询：

Phrase Match。查询应与条款和职位相匹配。
与"operator": "and"匹配，表示所有字词必须匹配，无论其顺序如何
一个简单的匹配查询。这意味着任何令牌都必须匹配

你可以看到，对于每一个我都指定了不同的提升值 - 这就是你如何优先考虑它们的重要性。

我希望这会有所帮助。

Answer 2

这就是我如何解决这个问题。我使用以下设置创建了索引

POST food_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_possessive_stemmer",
            "light_english_stemmer",
            "asciifolding"
          ]
        }
      },
      "filter": {
        "light_english_stemmer": {
          "type": "stemmer",
          "language": "light_english"
        },
        "english_possessive_stemmer": {
          "type": "stemmer",
          "language": "possessive_english"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "ingredients": {
          "type": "string",
          "analyzer": "my_custom_analyzer"
        }
      }
    }
  }
}

lowercase过滤器会根据名称提示小写所有单词，这有助于将饼干与饼干
possessive_english从字词中移除's，以便我们可以将饼干与饼干
light_english来阻止这些话。这种攻击性较低，并使用kstem令牌过滤器
asciifolding处理变音符号（我不认为它有用但是由你决定）

之后我插入了您在问题中提供的文件。我认为你需要简单的query string query。就文件scoring而言，这将满足您的所有要求。

{
  "query": {
    "query_string": {
      "default_field": "ingredients",
      "query": "digestive biscuits"
    }
  }
}

这正是我所要求的。请尝试使用这些设置并使用您的数据集进行查询，如果您遇到任何问题，请与我们联系。

我希望这有帮助！

了解Elasticsearch中的分析器，过滤器和查询

2 个答案: