Question

版本1

import string, pandas as pd
def correct_contraction1(x, dic):
    for word in dic.keys():
        if word in x:
            x = x.replace(word, " " + dic[word]+ " ")
    return x

版本2

import string, pandas as pd
def correct_contraction2(x, dic):
    for word in dic.keys():
        if " " + word + " " in x:
            x = x.replace(" " + word + " ", " " + dic[word]+ " ")
    return x

我如何使用它们：

train['comment_text'] = train['comment_text'].apply(correct_contraction1,args=(contraction_mapping,))
#3 mins 40 sec without that space thing (version1)

train['comment_text'] = train['comment_text'].apply(correct_contraction2,args=(contraction_mapping,))
#5 mins 56 sec with that space thing (version2)

有人可以解释为什么速度差异如此之大，这种情况不太可能发生，其次是更好/隐藏的熊猫技巧来进一步优化这一点？（该代码已在Kaggle内核上进行了多次测试）

train是在两种情况下都具有200万行的数据帧，也完全相同
contraction_mapping是一个字典映射...（在两种情况下都相同）
希望有最新的熊猫。

非常感谢！

编辑-1 -数据来自Kaggle Comp，版本1更快！

编辑-2非常感谢Rock，你们（我希望我能接受所有人！）

Answer 1

很抱歉无法回答区别，但是在任何情况下都可以轻松地改进当前方法。这对您来说很慢，因为您必须多次扫描所有句子（每个单词）。您甚至还要检查每个单词两次，首先检查每个单词是否存在，然后替换它-您只能替换。

这是进行文本替换时（无论是使用正则表达式，简单字符串替换还是什至在您开发自己的算法时）的关键课程：尝试仅遍历文本一次。无论您要替换多少个单词。正则表达式有很长的路要走，但是根据实现的需要，当找不到匹配项时，需要返回几个字符。对于感兴趣的人：寻找trie数据结构。

例如尝试快速文本搜索（aho-corasick）的实现。我正在为此开发一个库，但是在此之前，您可以使用flashtext（这有点不同）：

import flashtext
# already considers word boundaries, so no need for " " + word " "
fl = flashtext.KeywordProcessor()
fl.add_keywords_from_dict(dic)

train['comment_text'] = train['comment_text'].apply(fl.replace_keywords)

如果您要替换的单词很多，则速度会快几个数量级。

要比较第一个数据，我可以找到：

Words to replace: 8520
Sentences to replace in: 11230
Replacements made using flashtext: 1706
Replacements made using correct_contraction1: 25 

flashtext: (considers word boundaries and ignores case)
39 ms ± 355 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

correct_contraction1: (does not consider case nor words at end of line)
11.9 s ± 194 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

<unannounced>
30 ms ± 366 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

所以我们说的是300倍的加速。并非每天都这样；-）

作为参考，Jon Clements添加了正则表达式：

pandas.str.replace + regex (1733 replacements)
3.02 s ± 82.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我测试时，我的新库将减少30％。我也看到了Flashtext的2-3倍改进，但更重要的是，可以为用户提供更多的控制权。它功能齐全，只需要清理并添加更多文档即可。

我会在答案到时更新答案！

Answer 2

最好在这里使用Pandas的Series.str.replace并根据查找表的内容为其提供编译后的正则表达式。这意味着字符串替换操作可以比应用功能更快地在Series上进行，这还意味着您无需以任何方式扫描字符串，而无需花费更多的时间...希望可以将您的时间减少到几秒钟分钟。

import re
import pandas as pd

corrections = {
    "it's": "it is",
    "can't": "can not",
    "won't": "will not",
    "haven't": "have not"
}

sample = pd.Series([
    "Stays the same",
    "it's horrible!",
    "I hope I haven't got this wrong as that won't do",
    "Cabbage"
])

然后构建您的正则表达式，以便它查找字典中所有可能匹配的键（不区分大小写并遵守单词边界）：

rx = re.compile(r'(?i)\b({})\b'.format('|'.join(re.escape(c) for c in corrections)))

然后应用于您的列（例如，将sample更改为training['comment_text']）一个str.replace传递正则表达式和一个接受匹配项并返回找到的键的匹配值的函数：

corrected = sample.str.replace(rx, lambda m: corrections.get(m.group().lower()))

那么您将拥有corrected作为包含以下内容的系列：

['Stays the same',
 'it is horrible!',
 'I hope I have not got this wrong as that will not do',
 'Cabbage']

请注意，It's的大小写...不区分大小写，而是制成it is ...保留大小写的方法有很多种，但这可能并不是很重要，并且完全是另一个问题

Answer 3

第二个版本每次都要在循环中执行串联" " + word + " "，当找到匹配项时，它第二次执行替换。这会使它变慢。

您无法避免第一个串联（除非您修改dic，以便键周围已经有空格）。但是您可以通过将其第一次保存在变量中来避免第二次串联。仍然会比第一个版本慢，但幅度不大。

def correct_contraction2(x, dic):
    for word in dic.keys():
        spaceword = " " + word + " "
        if spaceword in x:
            x = x.replace(spaceword, " " + dic[word]+ " ")
    return x

似乎第二个版本可能在所有情况下均无法正常工作。如果单词在一行的开头或结尾，则不会被空格包围。最好使用带有\b的正则表达式来匹配单词边界。

为什么这两个变体之间的速度差异如此之大？

3 个答案: