Question

我有10M文本（适合RAM）和一种python字典：

"old substring":"new substring"

字典的大小是~15k子串。

我正在寻找用dict替换每个文本的最快方法（在每个文本中查找每个＆＃34;旧子字符串＆＃34;并用＆＃34替换它;新子字符串＆＃34;）。 / p>

源文本位于pandas数据帧中。现在我尝试了这些方法：

1）用reduce和str替换循环替换（~120行/秒）

replaced = []
for row in df.itertuples():
    replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))

2）在循环中使用简单的替换功能（＆＃34;映射＆＃34;是15k dict）（~160行/秒）：

def string_replace(text):
    for key in mapping:
        text = text.replace(key, mapping[key])
    return text

replaced = []
for row in tqdm(df.itertuples()):
    replaced.append(string_replace(row[1]))

同样.iterrows（）比.itertuples（）

慢20％

3）在系列上使用apply（也是~160行/秒）：

replaced = df['text'].apply(string_replace)

使用这些速度，处理整个数据集需要数小时。

任何人都有过这种质量子串替换的经验吗？有可能加快速度吗？它可能很棘手或丑陋，但必须尽可能快，不必使用熊猫。

感谢。

更新： 玩具数据来检查这个想法：

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

结果预期：

                      old         replaced
0   first text to replace        FT to rep
1  second text to replace  2nd text to rep

Answer 1

我认为你正在寻找用df替换正则表达式

如果您使用hava字典，则将其作为参数传递。

d = {'old substring':'new substring','anohter':'another'}

对于整个数据框

df.replace(d,regex=True)

对于系列

df[columns].replace(d,regex=True)

示例

df = pd.DataFrame({ "old":
                ["first text to replace",
               "second text to replace"]
                })

mapping = {"first text": "FT", 
       "replace": "rep",
       "second": '2nd'}

df['replaced'] = df['old'].replace(mapping,regex=True)

Answer 2

一种解决方案是将字典转换为trie并编写代码，以便只传递修改过的文本一次。

基本上，你逐步浏览文本和trie一个字符，一旦找到匹配，就可以替换它。

当然，如果您还需要将替换应用于已经替换的文本，这就更难了。

Answer 3

我再次克服了这个问题，找到了一个名为flashtext的神奇图书馆。

拥有15k词汇量的10M记录的加速大约是x100（真正比我的第一篇文章中的regexp或其他方法快一百倍）！

非常容易使用：

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

import flashtext
processor = flashtext.KeywordProcessor()

for k, v in mapping.items():
    processor.add_keyword(k, v)

print(list(map(processor.replace_keywords, df["old"])))

结果：

['FT to rep', '2nd text to rep']

如果需要，还可以使用processor.non_word_boundaries属性灵活地适应不同的语言。

这里使用的基于Trie的搜索提供了惊人的加速。

用字典替换子字符串的最快方法（在大型数据集上）

3 个答案: