Question

我有一个示例数据框如下：

df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

  name                notes
0  Walter White     This speling is incorrect
1  Walter White     Corrector should correct korrecter

我想调整Peter Norvig的拼写检查程序here。然后，我想通过遍历行中的每个单词将此函数应用于每一行。我想知道如何在Python Pandas上下文中完成这个工作？

我希望输出为：

    name                notes
0  Walter White     This spelling is incorrect
1  Walter White     Corrector should correct corrector

感谢任何输入。谢谢！

Answer 1

您可以使用str.split尝试此解决方案，但我认为大df的效果可能有问题：

import pandas as pd
import numpy as np

df = pd.DataFrame({
'notes': pd.Series(['This speling is incorrect', 'Corrector should correct korrecter one']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})
print df
           name                                   notes
0  Walter White               This speling is incorrect
1  Walter White  Corrector should correct korrecter one    

#simulate function correct
def correct(x):
    return x + '888'

#split column notes and apply correct
df1 = df.notes.str.split(expand=True).apply(correct)
print df1
              0           1           2             3       4
0       This888  speling888       is888  incorrect888     NaN
1  Corrector888   should888  correct888  korrecter888  one888

#remove NaN and concanecate all words together
df['notes'] = df1.fillna('').apply(lambda row: ' '.join(row), axis=1)
print df
           name                                              notes
0  Walter White             This888 speling888 is888 incorrect888 
1  Walter White  Corrector888 should888 correct888 korrecter888...

Answer 2

我使用了您发布的链接中的代码，以使其正常运行。用这个作为灵感。

import re, collections
import pandas as pd

# This code comes from the link you have posted
def words(text): return re.findall('[a-z]+', text.lower()) 

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

def correct(word):
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    return max(candidates, key=NWORDS.get)

NWORDS = train(words(file('big.txt').read()))

alphabet = 'abcdefghijklmnopqrstuvwxyz'

# This is your code
df = pd.DataFrame({
'notes': pd.Series(['speling', 'korrecter']), 
'name': pd.Series(['Walter White', 'Walter White']), 
})

# Spellchecking can be optimized, of course and not hardcoded
for i, row in df.iterrows():
    df.set_value(i,'notes',correct(row['notes']))

将函数应用于pandas dataframe列中每一行的每个单词

2 个答案: