我有一些包含拼写错误的数据。我正在纠正它们并使用以下代码评分拼写的接近程度:
import pandas as pd
import difflib
Li_A = ["potato", "tomato", "squash", "apple", "pear"]
B = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}
df_B = pd.DataFrame(B)
# Define the function that corrects the spelling:
def Spelling(ask):
return difflib.get_close_matches(ask, Li_A, n=3, cutoff=0.5)[0]
df_B['Correct one'] = df_B['one'].apply(Spelling)
# Define the function that Scores the spelling:
def Spell_Score(row):
return difflib.SequenceMatcher(None, row['one'], row['Correct one']).ratio()
df_B['Score'] = df_B.apply(Spell_Score, axis=1)
输出正确的拼写和分数:
df_B
one two Correct one Score
a potat0 po1ato potato 0.833333
b toma3o 2omato tomato 0.833333
c s5uash squ0sh squash 0.833333
d ap8le 2pple apple 0.800000
e pea7 p3ar pear 0.750000
如何添加列以获得第二和第三高得分结果及其分数?
答案 0 :(得分:1)
Simplier只使用一个函数,并将所有转换为Series
且最后join
的值返回原始DataFrame
:
def Spelling(ask):
a = difflib.get_close_matches(ask, Li_A, n=3, cutoff=0.5)
#list comprehension for all values of a
b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
return pd.Series(a + b)
df = df_B['one'].apply(Spelling)
#get correct columns names
a = len(df.columns) // 2
df.columns = ['Spelling_{}'.format(x) for x in range(a)] + \
['Spell_Score{}'.format(y) for y in range(a)]
df1 = df_B.join(df)
print (df1)
one two Spelling_0 Spelling_1 Spell_Score0 Spell_Score1
a potat0 po1ato potato tomato 0.833333 0.5
b toma3o 2omato tomato 0.833333 NaN NaN
c s5uash squ0sh squash 0.833333 NaN NaN
d ap8le 2pple apple 0.8 NaN NaN
e pea7 p3ar pear 0.75 NaN NaN