Question

我有一些包含拼写错误的数据。我正在纠正它们并使用以下代码评分拼写的接近程度：

import pandas as pd
import difflib

Li_A = ["potato", "tomato", "squash", "apple", "pear"]

B    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
        'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

df_B = pd.DataFrame(B)

# Define the function that corrects the spelling:
def Spelling(ask):
    return difflib.get_close_matches(ask, Li_A, n=3, cutoff=0.5)[0]

df_B['Correct one'] = df_B['one'].apply(Spelling)

# Define the function that Scores   the spelling:
def Spell_Score(row):
    return difflib.SequenceMatcher(None, row['one'], row['Correct one']).ratio()

df_B['Score'] = df_B.apply(Spell_Score, axis=1)

输出正确的拼写和分数：

df_B
       one     two Correct one     Score
 a  potat0  po1ato      potato  0.833333
 b  toma3o  2omato      tomato  0.833333
 c  s5uash  squ0sh      squash  0.833333
 d   ap8le   2pple       apple  0.800000
 e    pea7    p3ar        pear  0.750000

如何添加列以获得第二和第三高得分结果及其分数？

Answer 1

Simplier只使用一个函数，并将所有转换为Series且最后join的值返回原始DataFrame：

def Spelling(ask):
    a = difflib.get_close_matches(ask, Li_A, n=3, cutoff=0.5)
    #list comprehension for all values of a
    b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
    return pd.Series(a + b)

df = df_B['one'].apply(Spelling)

#get correct columns names
a = len(df.columns) // 2
df.columns = ['Spelling_{}'.format(x) for x in range(a)] + \
             ['Spell_Score{}'.format(y) for y in range(a)]
df1 = df_B.join(df)
print (df1)
      one     two Spelling_0 Spelling_1  Spell_Score0  Spell_Score1
a  potat0  po1ato     potato     tomato      0.833333           0.5
b  toma3o  2omato     tomato   0.833333           NaN           NaN
c  s5uash  squ0sh     squash   0.833333           NaN           NaN
d   ap8le   2pple      apple        0.8           NaN           NaN
e    pea7    p3ar       pear       0.75           NaN           NaN

数据框中的多个拼写结果

1 个答案: