列文的Levenshtein距离

时间:2016-05-24 06:30:54

标签: python loops text cluster-analysis levenshtein-distance

我想使用Levenshtein Distance将我的单词列表分成若干簇。

data = pd.read_csv("data.csv")
Target_Column = data["words"]
Target = Target_Column.tolist()
clusters = defaultdict(list)
threshold =5
numb = range(len(Target))

for i in numb:
    for j in range(i+1, len(numb)):
        if distance(Target[i],Target[j]) <= threshold:
            clusters[i].append(Target[j])
            clusters[j].append(Target[i])

但是当我在循环列表上运行时,会重复一些集群。请帮我解决这个问题

1 个答案:

答案 0 :(得分:0)

如果你只有字符串,为什么不使用一套?

Target = set(Target_Column.tolist())

您还可以使用集合的默认值进行映射:

clusters = defaultdict(set)

但这需要在循环中将list.append更改为set.add

然而,您的代码还有更多的pythonic替代方案。

我可能会在运行中生成从单词到其连接集的映射。

以下示例假设words是所有单词的set

clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}

实例:

>>> distance = lambda x, y: abs(len(x) - len(y))
>>> words = set("abc def abcd abcdefghijk abcdefghijklmnopqrstuv".split())
>>> threshold = 3
>>> for cluster, values in clusters.items():
...     print cluster, ": ", ", ".join(values)
...
abcd :  abcd, abc, def
abc :  abcd, abc, def
abcdefghijk :  abcdefghijk
abcdefghijklmnopqrstuv :  abcdefghijklmnopqrstuv
def :  abcd, abc, def

增加阈值我们为所有单词获得一个大的“群集”:

>>> threshold = 100
>>> clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}
>>> for cluster, values in clusters.items():
...     print cluster, ": ", ", ".join(values)
...
abcd :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abc :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijk :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijklmnopqrstuv :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
def :  abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def