如何在Python中计算Levenshtein字符串的距离矩阵
str1 str2 str3 str4 ... strn
str1 0.8 0.4 0.6 0.1 ... 0.2
str2 0.4 0.7 0.5 0.1 ... 0.1
str3 0.6 0.5 0.6 0.1 ... 0.1
str4 0.1 0.1 0.1 0.5 ... 0.6
. . . . . ... .
. . . . . ... .
. . . . . ... .
strn 0.2 0.1 0.1 0.6 ... 0.7
使用Ditance函数我们可以计算2个单词之间的距离。但是这里有1个包含n个字符串的列表。我希望计算距离矩阵之后我想要进行单词聚类。
答案 0 :(得分:1)
只需使用接受自定义指标的pdist
版本。
Y = pdist(X, levensthein)
并且对于levensthein
,您可以使用Tanu建议的rosettacode实现
如果你想要一个完整的平方矩阵,只需在结果上使用squareform
:
Y = scipy.spatial.distance.squareform(Y)
答案 1 :(得分:1)
这是我的代码
import pandas as pd
from Levenshtein import distance
import numpy as np
Target = ['Tree','Trip','Treasure','Nothingtodo']
List1 = Target
List2 = Target
Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)
for i in range(0,len(List1)):
for j in range(0,len(List2)):
Matrix[i,j] = distance(List1[i],List2[j])
print Matrix
[[ 0 2 4 11]
[ 2 0 6 10]
[ 4 6 0 11]
[11 10 11 0]]
答案 2 :(得分:0)
你可以做这样的事情
from Levenshtein import distance
import numpy as np
from time import time
def get_distance_matrix(str_list):
""" Construct a levenshtein distance matrix for a list of strings"""
dist_matrix = np.zeros(shape=(len(str_list), len(str_list)))
t0 = time()
print "Starting to build distance matrix. This will iterate from 0 till ", len(str_list)
for i in range(0, len(str_list)):
print i
for j in range(i+1, len(str_list)):
dist_matrix[i][j] = distance(str_list[i], str_list[j])
for i in range(0, len(str_list)):
for j in range(0, len(str_list)):
if i == j:
dist_matrix[i][j] = 0
elif i > j:
dist_matrix[i][j] = dist_matrix[j][i]
t1 = time()
print "took", (t1-t0), "seconds"
return dist_matrix
str_list = ["analyze", "analyse", "analysis", "analyst"]
get_distance_matrix(str_list)
Starting to build distance matrix. This will iterate from 0 till 4
0
1
2
3
took 0.000197887420654 seconds
>>> array([[ 0., 1., 3., 2.],
[ 1., 0., 2., 1.],
[ 3., 2., 0., 2.],
[ 2., 1., 2., 0.]])