Question

我正在我的网络应用程序中实现搜索建议功能，并且一直在研究现有的技术实现。

似乎大多数主要网站（亚马逊，必应等）都通过以下方式实施模糊搜索：

Tokenize search string in to terms
processingSearchStringSet = {}
For each term
    if exact term is NOT in index
        Get possible terms (fuzzyTerms) from levenshtein(term, 1 (or 2))
        For each term in fuzzyTerms
            if term is in index
                processingSearchStringSet.intersect(stringsIndexedByTermsSet)
    else
        processingSearchStringSet.intersect(stringsIndexedByTermsSet)

然后，结果集成员可能按度量（例如：术语顺序保留，绝对术语位置，搜索流行度）进行排名，并根据此排名和预先确定的结果集大小进行保留或删除，然后再将其传递回用户

另一方面，谷歌的实施与此有很大不同。

具体而言，它允许搜索字符串的组成条款中出现1个以上的错误。错误阈值似乎取决于字符串中感兴趣的术语的位置，但它永远不会超过7。

有趣的是：

在整个范围内进行Levenstein搜索，阈值为5 术语空间，对于用户字符串中的每个术语都是疯狂的昂贵
即使＃1是完成的，它仍然不能解释缺席错误的建议

N-gram也没有看到被使用：修改一个术语，使其不包含原始术语中存在的二元组，似乎不会影响结果。

以下是一个说明我的发现的例子：

Example term: "Fiftyyyy shades of grey"

Amazon suggestions: none 
(if the error count exceeds 1 on any term, the search fails)

Bing suggestions: none
(if the error count exceeds 2 on any term, the search fails)

Google suggestions: 10 (max) 
(breaking the search would require 5 or more errors on any single term, 
or multiple errors on multiple terms)

我的问题是：什么类型的巫术在这里工作？他们只是使用具有巨大误差限额的Levenshtein搜索，还是使用我不知道的另一种技术？

Answer 1

也许你应该尝试这种方法：http://norvig.com/spell-correct.html

谷歌模糊搜索（a.k.a“建议”）：正在使用哪些技术？

1 个答案: