我的数据框只有一列,该列有1000行。 我需要比较所有行并找到所有行的Levenshtein距离。我如何计算python中的比率或距离
我有一个数据框如下:
#Df
StepDescription
click confirm button when done
you have logged on
please log in to proceed
click on confirm button
Dolb was released successfully
Enter your details
validate the statement
Aval was released sucessfully
如何计算所有这些的Levenshtein比率
代码我已编写迭代循环但迭代后如何继续。
import Levenshtein
import pandas as pd
data_dist = pd.read_csv('path\Data_TestDescription.csv')
df = pd.DataFrame(data_dist)
for index, row in df.iterrows():
答案 0 :(得分:0)
正如评论中所要求的那样,百分比是理想的,我会保留已接受的答案并仅添加新部分:
import numpy as np
import pandas as pd
from Levenshtein import distance
from itertools import product
#df = ...
dist = [distance(*x) for x in product(df.StepDescription, repeat=2)]
dist_df = pd.DataFrame(np.array(dist).reshape(df.shape[0], df.shape[0]))
dist_df
0 1 2 3 4 5 6 7
0 0 23 23 13 29 25 25 28
1 23 0 18 18 23 18 18 23
2 23 18 0 20 25 21 19 24
3 13 18 20 0 27 19 21 26
4 29 23 25 27 0 26 23 5
5 25 18 21 19 26 0 19 25
6 25 18 19 21 23 19 0 21
7 28 23 24 26 5 25 21 0
dist_df_percentage = dist_df // min(x for x in dist if x > 0) * 100
0 1 2 3 4 5 6 7
0 0 460 460 260 580 500 500 560
1 460 0 360 360 460 360 360 460
2 460 360 0 400 500 420 380 480
3 260 360 400 0 540 380 420 520
4 580 460 500 540 0 520 460 100
5 500 360 420 380 520 0 380 500
6 500 360 380 420 460 380 0 420
7 560 460 480 520 100 500 420 0
答案 1 :(得分:0)
最后经过大量的例子我尝试使用fuzzratio获得精确的比率或百分比
from itertools import product
import numpy as np
import difflib
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import editdistance
dist = np.empty(df.shape[0]**2, dtype=int)
for i, x in enumerate(product(df.Stepdescription, repeat=2)):
dist[i] = fuzz.ratio(*x)
dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))
out_csv= dist_df.to_csv('FuzzyRatio.csv', sep='\t')