我有一些包含拼写错误的数据。我正在纠正它们并使用以下代码评分拼写的接近程度:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_Q = pd.DataFrame(Q)
# Define the function that Corrects & Scores the Spelling
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)
# List comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
# Apply the function that Corrects & Scores the Spelling
df_A = df_Q['one'].apply(Spelling)
# Get the column names on the A dataframe
c = len(df_A.columns) // 2
df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
['Score_{}'.format(y) for y in range(c)]
# Join the Q & A dataframes
df_QA = df_Q.join(df_A)
这给出了结果:
df_QA
one two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4 \
a potat0 po1ato potato tomato pear apple squash
b toma3o 2omato tomato potato pear apple squash
c s5uash squ0sh squash pear apple tomato potato
d ap8le 2pple apple pear tomato squash potato
e pea7 p3ar pear potato apple tomato squash
Score_0 Score_1 Score_2 Score_3 Score_4
a 0.833333 0.500000 0.400000 0.181818 0.166667
b 0.833333 0.333333 0.200000 0.181818 0.166667
c 0.833333 0.200000 0.181818 0.166667 0.166667
d 0.800000 0.222222 0.181818 0.181818 0.181818
e 0.750000 0.400000 0.444444 0.200000 0.200000
对于行" e","马铃薯"在第1行和" apple"然而,苹果得分高于马铃薯。这对我的申请来说是错误的。
如何让得分更高的得分结果一直在左边?
编辑1 :我尝试了一个更简单的代码:
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
Q = "pea7"
A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)
&安培;得到了同样的结果:
A: ['pear', 'potato', 'apple', 'tomato', 'squash']
我还尝试了一个更简单的评分代码:
import difflib
S1 = difflib.SequenceMatcher(None, "pea7", "potato")
R1 = S1.ratio()
S2 = difflib.SequenceMatcher(None, "pea7", "apple")
R2 = S2.ratio()
&安培;我得到了同样的结果:
R1: 0.4
R2: 0.444
编辑2 我尝试使用fuzzywuzzy。我得到了相同的结果,因为fuzzywuzzy依赖于difflib:
from fuzzywuzzy import fuzz
R1 = fuzz.ratio("pea7", "potato")
R2 = fuzz.ratio("pea7", "apple")
答案 0 :(得分:0)
SequenceMatcher使用Ratcliff和Metzener于1988年描述的方法正确地计算了比率。也就是说,对于常见字符数(CC)和两个字符串中的字符总数(CT):>
ratio = 2.CC/CT
所以看起来问题出在get_close_matches