Question

我有一个很长的（> 1000项）单词列表，我希望从中删除与其他单词“太相似”的单词，直到剩下的单词都“显着不同”。例如，所以没有两个单词在编辑距离D内。

我不需要一个独特的解决方案，它不一定是最佳的，但它应该相当快（在Python中）并且不会丢弃太多条目。

我怎样才能做到这一点？感谢。

编辑：要清楚，我可以谷歌搜索测量编辑距离的python例程。问题是如何有效地做到这一点，或许，以某种方式找到D的“自然”价值。也许是通过从所有单词构建某种特里然后修剪？

Answer 1

你可以使用bk-tree，在添加每个项目之前检查它是否在任何其他项目的距离D内（感谢@DietrichEpp在这个想法的评论中。

您可以将this recipe用于bk树（尽管可以轻松修改任何类似的配方）。只需进行两项更改：更改行：

def __init__(self, items, distance, usegc=False):

到

def __init__(self, items, distance, threshold=0, usegc=False):

并更改行

        if el not in self.nodes: # do not add duplicates

到

        if (el not in self.nodes and
            (threshold == None or len(self.find(el, threshold)) == 0)):

这可确保添加项目时没有重复项。然后，从列表中删除重复项的代码就是：

from Levenshtein import distance
from bktree import BKtree
def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

注意这依赖于python-Levenshtein包的距离函数，这比bk-tree提供的快得多。 python-Levenshtein有C编译组件，但值得安装。

最后，我设置了一个性能测试，其中包含越来越多的单词（从/usr/share/dict/words中随机获取），并绘制了每次运行所花费的时间：

import random
import time
from Levenshtein import distance
from bktree import BKtree

with open("/usr/share/dict/words") as inf:
    word_list = [l[:-1] for l in inf]

def remove_duplicates(lst, threshold):
    tr = BKtree(iter(lst), distance, threshold)
    return tr.nodes.keys()

def time_remove_duplicates(n, threshold):
    """Test using n words"""
    nwords = random.sample(word_list, n)
    t = time.time()
    newlst = remove_duplicates(nwords, threshold)
    return len(newlst), time.time() - t

ns = range(1000, 16000, 2000)
results = [time_remove_duplicates(n, 3) for n in ns]
lengths, timings = zip(*results)

from matplotlib import pyplot as plt

plt.plot(ns, timings)
plt.xlabel("Number of strings")
plt.ylabel("Time (s)")
plt.savefig("number_vs_time.pdf")

enter image description here

如果不以数学方式确认，我认为它不是二次的，我认为它实际上可能是n log n，如果插入bk树是一个日志时间操作，这将是有意义的。最值得注意的是，它运行速度非常快，不到5000个字符串，这有望成为OP的目标（并且合理 15000，传统的for循环解决方案不会）。

Answer 2

尝试不会有用，哈希地图也不会有用。它们对于像这样的空间高维问题根本没用。

但这里真正的问题是“有效”的指定要求不明确。 “有效”的速度有多快？

import Levenshtein

def simple(corpus, distance):
    words = []
    while corpus:
        center = corpus[0]
        words.append(center)
        corpus = [word for word in corpus
                  if Levenshtein.distance(center, word) >= distance]
    return words

我在我的硬盘上的“美国英语”词典中统一选择10,000个单词，然后查找距离为5的集合，产生大约2,000个条目。

real    0m2.558s
user    0m2.404s
sys     0m0.012s

所以，问题是，“效率如何有效”？由于您没有指定您的要求，因此我很难知道此算法是否适合您。

兔子洞

如果你想要更快的东西，我就是这样做的。

创建VP树，BK树或其他合适的空间索引。对于语料库中的每个单词，如果该单词与索引中的每个单词具有合适的最小距离，则将该单词插入到树中。空间索引专门用于支持此类查询。

最后，您将拥有一个包含所需最小距离的节点的树。

Answer 3

你的想法绝对有趣。 This page有一个很好的快速编辑距离计算设置，如果你需要将你的词汇表扩展到数百万而不是一千，这肯定是有效的，这在语料库语言学业务中相当小。

祝你好运，这听起来像是一个有趣的问题代表！

通过过滤生成不同（远距离，通过编辑距离）单词的列表

3 个答案:

兔子洞