我想使用Levenshtein Distance将我的单词列表分成若干簇。
data = pd.read_csv("data.csv")
Target_Column = data["words"]
Target = Target_Column.tolist()
clusters = defaultdict(list)
threshold =5
numb = range(len(Target))
for i in numb:
for j in range(i+1, len(numb)):
if distance(Target[i],Target[j]) <= threshold:
clusters[i].append(Target[j])
clusters[j].append(Target[i])
但是当我在循环列表上运行时,会重复一些集群。请帮我解决这个问题
答案 0 :(得分:0)
如果你只有字符串,为什么不使用一套?
Target = set(Target_Column.tolist())
您还可以使用集合的默认值进行映射:
clusters = defaultdict(set)
但这需要在循环中将list.append
更改为set.add
。
然而,您的代码还有更多的pythonic替代方案。
我可能会在运行中生成从单词到其连接集的映射。
以下示例假设words
是所有单词的set
:
clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}
实例:
>>> distance = lambda x, y: abs(len(x) - len(y))
>>> words = set("abc def abcd abcdefghijk abcdefghijklmnopqrstuv".split())
>>> threshold = 3
>>> for cluster, values in clusters.items():
... print cluster, ": ", ", ".join(values)
...
abcd : abcd, abc, def
abc : abcd, abc, def
abcdefghijk : abcdefghijk
abcdefghijklmnopqrstuv : abcdefghijklmnopqrstuv
def : abcd, abc, def
增加阈值我们为所有单词获得一个大的“群集”:
>>> threshold = 100
>>> clusters = {w1: set(w2 for w2 in words if distance(w1, w2) <= threshold) for w1 in words}
>>> for cluster, values in clusters.items():
... print cluster, ": ", ", ".join(values)
...
abcd : abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abc : abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijk : abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
abcdefghijklmnopqrstuv : abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def
def : abcd, abc, abcdefghijk, abcdefghijklmnopqrstuv, def