Question

我想构建一个应用程序，其中匹配要求文档中的每个标记至少包含一次！

请注意其与标准期望相反的方式。所以文档现在相当小，而查询可能很长。例如：

文件：

"elastic super cool".

有效的查询匹配将是

"I like elastic things since elasticsearch is super cool"

我设法从弹性搜索中获得匹配的令牌数量（另请参阅https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/ttJTE52hXf8）。因此，在上面的例子中，3个匹配（=文档长度）意味着查询匹配。

但我怎样才能将它与同义词结合起来???

假设“酷”的同义词将是“好”，“好”和“好”。通过使用同义词令牌过滤器，我设法将同义词添加到文档中的每个位置。

因此，以下四个文档每个都有上述查询的3个标记匹配：

"elastic super nice"

"elastic nice cool"

"nice good great"

"good great cool"

但只有第一场比赛是有效的比赛！

我怎样才能避免每个同义词匹配计为一个匹配，尽管它们代表文档中的相同标记？

有任何想法如何解决这个问题？

我读到过滤器可能会解决这个问题，但我仍然不确定打击器是否会按照我想要的方式使用同义词...

想法？

Answer 1

我假设您展开了同义词。您可以使用脚本来计算匹配位置。

Elasticsearch Google Group with a solution by Vineeth Mohan

我将他的脚本调整为原生脚本，返回0到1之间的数字，表示该字段中匹配位置的比例。我稍微调整了一下，每个查询只匹配一个位置

您需要一个包含位数的字段，例如使用实际计算位数的token_count

@Override
public Object run()
{
    IndexField indexField = this.indexLookup().get(field);
    Long numberOfPositions = ((ScriptDocValues.Longs) doc().get(positionsField)).getValue();

    ArrayList<Integer> positions = new ArrayList<Integer>();
    for (String term : terms)
    {
        Iterator<TermPosition> termPos = indexField.get(term, IndexLookup.FLAG_POSITIONS | IndexLookup.FLAG_CACHE)
                .iterator();
        while (termPos.hasNext())
        {
            int position = termPos.next().position;
            if (positions.contains(position))
            {
                continue;
            }
            positions.add(position);
            // if the term matches multiple positions, only a new position should count
            break;
        }
    }

    return positions.size() * 1.0 / numberOfPositions;
}

您可以在查询中将其用作function_score脚本。

{
"function_score": {
    "query": {
        "match": {
            "message": "I like elastic things since elasticsearch is super cool"
        }
    },
    "script_score": {
        "params": {
            "terms": [
                "I",
                "like", 
                "elastic", 
                "things", 
                "since", 
                "elasticsearch", 
                "is", 
                "super", 
                "cool"
            ],
            "field": "message",
            "positions_field": "message.pos_count"
        },
        "lang": "native",
        "script": "matched_positions_ratio"
    },
    "boost_mode": "replace"
}
}

然后，您可以将“min_score”设置为1，只获取与给定字段中所有位置匹配的文档。

我希望这个解决方案是您所需要的。

Elasticsearch - 使用同义词检查查询中是否包含文档

1 个答案: