Question

假设我有一个带有拼写错误和类似小变体的电影名单 -

 "Pirates of the Caribbean: The Curse of the Black Pearl"
 "Pirates of the carribean"
 "Pirates of the Caribbean: Dead Man's Chest"
 "Pirates of the Caribbean trilogy"
 "Pirates of the Caribbean"
 "Pirates Of The Carribean"

如何分组或查找这样的单词集，最好使用python和/或redis？

Answer 1

看看“模糊匹配”。下面的主题中的一些很棒的工具可以计算字符串之间的相似性。

我特别喜欢difflib模块

>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
>>> import keyword
>>> get_close_matches('wheel', keyword.kwlist)
['while']
>>> get_close_matches('apple', keyword.kwlist)
[]
>>> get_close_matches('accept', keyword.kwlist)
['except']

https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison

Answer 2

您可能会注意到类似的字符串具有较大的公共子字符串，例如：

“Bla bla bLa”和“Bla bla bRa”=＆gt;常见的子串是“Bla bla ba”（注意第三个字）

要查找公共子字符串，您可以使用动态编程算法。算法变化之一是 Levenshtein距离（大多数相似字符串之间的距离非常小，并且更多不同字符串之间的距离更大） - http://en.wikipedia.org/wiki/Levenshtein_distance。

同样为了快速表现，您可以尝试调整 Soundex算法 - http://en.wikipedia.org/wiki/Soundex。

因此，在计算所有字符串之间的距离后，您必须对它们进行聚类。最简单的方法是 k-means （但需要您定义簇的数量）。如果您实际上不知道群集的数量，则必须使用分层群集。 请注意，您的情况下群集的数量是不同电影标题的数量+ 1 （对于完全拼写错误的字符串）。

Answer 3

要向Fredrik的答案添加另一个提示，您还可以从搜索引擎中获得灵感，例如此代码：

def dosearch(terms, searchtype, case, adddir, files = []):
    found = []
    if files != None:
        titlesrch = re.compile('>title<.*>/title<')
        for file in files:
            title = ""
            if not (file.lower().endswith("html") or file.lower().endswith("htm")):
                continue
            filecontents = open(BASE_DIR + adddir + file, 'r').read()
            titletmp = titlesrch.search(filecontents)
            if titletmp != None:
                title = filecontents.strip()[titletmp.start() + 7:titletmp.end() - 8]
            filecontents = remove_tags(filecontents)
            filecontents = filecontents.lstrip()
            filecontents = filecontents.rstrip()
            if dofind(filecontents, case, searchtype, terms) > 0:
                found.append(title)
                found.append(file)
    return found

来源以及更多信息：http://www.zackgrossbart.com/hackito/search-engine-python/

此致

最高

Answer 4

我认为实际上有两个不同的问题。

首先是拼写纠正。你可以在这里用Python获得一个

http://norvig.com/spell-correct.html

第二个更具功能性。这是我在拼写纠正后要做的事情。我会建立一个关系函数。

相关的（第1句，第2句）当且仅当第1句和第2句有罕见的常用词时。很少见，我的意思是不同于（The，what，is等等）。您可以查看TF / IDF系统，以确定两个文档是否与其相关联。谷歌搜索了一下我发现了这个：

https://code.google.com/p/tfidf/

Answer 5

一种方法是在比较它们之前预处理所有字符串：将all转换为小写，标准化空格（例如，用单个空格替换任何空格）。如果标点符号对您的最终目标不重要，您也可以删除所有标点字符。

Levenshtein distance通常用于确定字符串的相似性，这可以帮助您对因拼写错误较小而不同的字符串进行分组。

将类似词汇分组的好策略是什么？

5 个答案: