Question

我需要创建一个函数，它接受一个名为new_data的数据框类型对象（包含文本的列），我需要将这些单词与我的引用进行比较。我的引用ref_data由2列组成，其中一列具有错误拼写的单词（与new_data相同的形式），第二列包含其更正后的版本。

简单地说，我需要将new_data的每个单词与ref_data的第1列进行比较，如果匹配，则返回与该单词对应的第2列的单词。

例如，如果new_data的单词与第3行的ref_data的单词匹配，则第3行的第2列中的单词将替换它。如果需要，将提供更多说明。这是我试过的：

我试过这个：

x = [line for line in ref_data['word']] #x is a list of all incorrect words
y = [line for line in ref_data['final']] #y is a list of all correct words
def replace_words(x): #function
for line in x: #iterate over lines in list
    for word in line.split(): #iterate over words in list
        if word == x:   #i dont know the syntax to compare with it.problem here
           return (word = y)  #i need to return y of the same index.

Answer 1

方法replace对此有好处。不要将错误/正确的映射放入DataFrame的两列，而是使用Series。

corrections = Series(correct_spellings, index=incorrect_spellings)
new_data_corrected = new_data.replace(corrections)

这是一个简单的例子。我用简单的字母;当然，它会用文字相同。

In [10]: new_data
Out[10]: 
0    a
1    b
2    c
dtype: object

In [11]: corrections
Out[11]: 
c    C
b    B
dtype: object

In [12]: new_data.replace(corrections)
Out[12]: 
0    a
1    B
2    C
dtype: object

Python中的数据框对象中的拼写校正

1 个答案: