Python中的字符串距离矩阵

时间:2016-05-25 06:05:05

标签: python string machine-learning text-mining levenshtein-distance

如何在Python中计算Levenshtein字符串的距离矩阵

              str1    str2    str3    str4    ...     strn
      str1    0.8     0.4     0.6     0.1     ...     0.2
      str2    0.4     0.7     0.5     0.1     ...     0.1
      str3    0.6     0.5     0.6     0.1     ...     0.1
      str4    0.1     0.1     0.1     0.5     ...     0.6
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      strn    0.2     0.1     0.1     0.6     ...     0.7

使用Ditance函数我们可以计算2个单词之间的距离。但是这里有1个包含n个字符串的列表。我希望计算距离矩阵之后我想要进行单词聚类。

3 个答案:

答案 0 :(得分:1)

只需使用接受自定义指标的pdist版本。

Y = pdist(X, levensthein)

并且对于levensthein,您可以使用Tanu建议的rosettacode实现

如果你想要一个完整的平方矩阵,只需在结果上使用squareform

Y = scipy.spatial.distance.squareform(Y)

答案 1 :(得分:1)

这是我的代码

import pandas as pd
from Levenshtein import distance
import numpy as np

Target = ['Tree','Trip','Treasure','Nothingtodo']

List1 = Target
List2 = Target

Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)

for i in range(0,len(List1)):
  for j in range(0,len(List2)):
      Matrix[i,j] = distance(List1[i],List2[j])

print Matrix

[[ 0  2  4 11]
 [ 2  0  6 10]
 [ 4  6  0 11]
 [11 10 11  0]]

答案 2 :(得分:0)

你可以做这样的事情

from Levenshtein import distance
import numpy as np
from time import time

def get_distance_matrix(str_list):
    """ Construct a levenshtein distance matrix for a list of strings"""
    dist_matrix = np.zeros(shape=(len(str_list), len(str_list)))
    t0 = time()
    print "Starting to build distance matrix. This will iterate from 0 till ", len(str_list) 
    for i in range(0, len(str_list)):
        print i
        for j in range(i+1, len(str_list)):
                dist_matrix[i][j] = distance(str_list[i], str_list[j]) 
    for i in range(0, len(str_list)):
        for j in range(0, len(str_list)):
            if i == j:
                dist_matrix[i][j] = 0 
            elif i > j:
                dist_matrix[i][j] = dist_matrix[j][i]
    t1 = time()
    print "took", (t1-t0), "seconds"
    return dist_matrix

str_list = ["analyze", "analyse", "analysis", "analyst"]
get_distance_matrix(str_list)

Starting to build distance matrix. This will iterate from 0 till  4
0
1
2
3
took 0.000197887420654 seconds
>>> array([[ 0.,  1.,  3.,  2.],
   [ 1.,  0.,  2.,  1.],
   [ 3.,  2.,  0.,  2.],
   [ 2.,  1.,  2.,  0.]])