如何计算python中我的列中行的Levenshtein比率/距离?

时间:2017-11-07 07:36:22

标签: python pandas dataframe levenshtein-distance

我的数据框只有一列,该列有1000行。 我需要比较所有行并找到所有行的Levenshtein距离。我如何计算python中的比率或距离

我有一个数据框如下:

  #Df 
  StepDescription
  click confirm button when done
  you have logged on
  please log in to proceed
  click on confirm button
  Dolb was released successfully
  Enter your details
  validate the statement
  Aval was released sucessfully

如何计算所有这些的Levenshtein比率

代码我已编写迭代循环但迭代后如何继续。

  import Levenshtein
  import pandas as pd
  data_dist = pd.read_csv('path\Data_TestDescription.csv')
  df = pd.DataFrame(data_dist)
  for index, row in df.iterrows():

2 个答案:

答案 0 :(得分:0)

正如评论中所要求的那样,百分比是理想的,我会保留已接受的答案并仅添加新部分:

import numpy as np
import pandas as pd
from Levenshtein import distance
from itertools import product

#df = ...

dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]

dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
dist_df

    0   1   2   3   4   5   6   7
0   0  23  23  13  29  25  25  28
1  23   0  18  18  23  18  18  23
2  23  18   0  20  25  21  19  24
3  13  18  20   0  27  19  21  26
4  29  23  25  27   0  26  23   5
5  25  18  21  19  26   0  19  25
6  25  18  19  21  23  19   0  21
7  28  23  24  26   5  25  21   0

dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100

     0    1    2    3    4    5    6    7
0    0  460  460  260  580  500  500  560
1  460    0  360  360  460  360  360  460
2  460  360    0  400  500  420  380  480
3  260  360  400    0  540  380  420  520
4  580  460  500  540    0  520  460  100
5  500  360  420  380  520    0  380  500
6  500  360  380  420  460  380    0  420
7  560  460  480  520  100  500  420    0

答案 1 :(得分:0)

最后经过大量的例子我尝试使用fuzzratio获得精确的比率或百分比

from itertools import product
import numpy as np
import difflib
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import editdistance
dist = np.empty(df.shape[0]**2, dtype=int) 
for i, x in enumerate(product(df.Stepdescription, repeat=2)): 
    dist[i] = fuzz.ratio(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
out_csv= dist_df.to_csv('FuzzyRatio.csv', sep='\t')