数据框中的多个拼写结果1

时间:2018-02-20 15:19:20

标签: python dataframe difflib spelling fuzzywuzzy

我有一些包含拼写错误的数据。我正在纠正它们并使用以下代码评分拼写的接近程度:

 import pandas as pd
 import difflib

 Li_A = ["potato", "tomato", "squash", "apple", "pear"]

 Q    = {'one' : pd.Series(["potat0", "toma3o", "s5uash", "ap8le", "pea7"], index=['a', 'b', 'c', 'd', 'e']),
         'two' : pd.Series(["po1ato", "2omato", "squ0sh", "2pple", "p3ar"], index=['a', 'b', 'c', 'd', 'e'])}

 df_Q = pd.DataFrame(Q)

 # Define the function that Corrects & Scores the Spelling
 def Spelling(ask):
     a = difflib.get_close_matches(ask, Li_A, n=5, cutoff=0.1)

     # List comprehension for all values of a
     b = [difflib.SequenceMatcher(None, ask, x).ratio() for x in a]
     return pd.Series(a + b)

 # Apply the function that Corrects & Scores the Spelling
 df_A = df_Q['one'].apply(Spelling)

 # Get the column names on the A dataframe
 c = len(df_A.columns) // 2
 df_A.columns = ['Spelling_{}'.format(x) for x in range(c)] + \
                ['Score_{}'.format(y)    for y in range(c)]

 # Join the Q & A dataframes
 df_QA = df_Q.join(df_A)

这给出了结果:

 df_QA
       one     two Spelling_0 Spelling_1 Spelling_2 Spelling_3 Spelling_4  \
 a  potat0  po1ato     potato     tomato       pear      apple     squash   
 b  toma3o  2omato     tomato     potato       pear      apple     squash   
 c  s5uash  squ0sh     squash       pear      apple     tomato     potato   
 d   ap8le   2pple      apple       pear     tomato     squash     potato   
 e    pea7    p3ar       pear     potato      apple     tomato     squash   

     Score_0   Score_1   Score_2   Score_3   Score_4  
 a  0.833333  0.500000  0.400000  0.181818  0.166667  
 b  0.833333  0.333333  0.200000  0.181818  0.166667  
 c  0.833333  0.200000  0.181818  0.166667  0.166667  
 d  0.800000  0.222222  0.181818  0.181818  0.181818  
 e  0.750000  0.400000  0.444444  0.200000  0.200000  

对于行" e","马铃薯"在第1行和" apple"然而,苹果得分高于马铃薯。这对我的申请来说是错误的。

如何让得分更高的得分结果一直在左边?

编辑1 :我尝试了一个更简单的代码:

 import difflib
 Li_A = ["potato", "tomato", "squash", "apple", "pear"]
 Q    = "pea7"
 A = difflib.get_close_matches(Q, Li_A, n=5, cutoff=0.1)

&安培;得到了同样的结果:

 A: ['pear', 'potato', 'apple', 'tomato', 'squash']

我还尝试了一个更简单的评分代码:

 import difflib
 S1 = difflib.SequenceMatcher(None, "pea7", "potato")
 R1 = S1.ratio()
 S2 = difflib.SequenceMatcher(None, "pea7", "apple")
 R2 = S2.ratio()

&安培;我得到了同样的结果:

 R1: 0.4
 R2: 0.444

编辑2 我尝试使用fuzzywuzzy。我得到了相同的结果,因为fuzzywuzzy依赖于difflib:

 from fuzzywuzzy import fuzz
 R1 = fuzz.ratio("pea7", "potato")
 R2 = fuzz.ratio("pea7", "apple")

1 个答案:

答案 0 :(得分:0)

SequenceMatcher使用Ratcliff和Metzener于1988年描述的方法正确地计算了比率。也就是说,对于常见字符数(CC)和两个字符串中的字符总数(CT):

ratio = 2.CC/CT 

所以看起来问题出在get_close_matches