Question

我有一个大的100mb文件，我想在其上执行大约5000个字符串替换，实现此目的的最有效方法是什么？

没有更好的方法可以逐行读取文件并在每行上执行5000次替换吗？

我还尝试在打开文件并对字符串执行5000次替换时使用.read方法将文件作为字符串读取，但这甚至更慢，因为它会生成整个文件的5000个副本。

此脚本必须使用python 2.6

在Windows上运行

提前致谢

Answer 1

按此顺序尝试以下操作，直到获得足够快的速度。

将文件读入一个大字符串并依次进行每次替换，覆盖同一个变量。

with open(..., 'w') as f:
    s = f.read()
    for src, dest in replacements:
        s = s.replace(src, dest)
    f.seek(0)
    f.write(s)

Memory map该文件，并编写一个替换的自定义替换函数。

Answer 2

我建议，不要进行5000次搜索，而是搜索5000项：

import re

replacements = {
    "Abc-2454": "Gb-43",
    "This": "that",
    "you": "me"
}

pat = re.compile('(' + '|'.join(re.escape(key) for key in replacements.iterkeys()) + ')')
repl = lambda match: replacements[match.group(0)]

您现在可以将re.sub应用于整个文件

with open("input.txt") as inf:
    s = inf.read()

s = pat.sub(repl, s)

with open("result.txt") as outf:
    outf.write(s)

或逐行，

with open("input.txt") as inf, open("result.txt") as outf:
    outf.writelines(pat.sub(repl, line) for line in inf)

Answer 3

您应该使用open（）和read（）读取文本，然后使用（编译的）正则表达式来进行字符串替换。一个简短的例子：

import re

# read data
f = open("file.txt", "r")
txt = f.read()
f.close()

# list of patterns and what to replace them with
xs = [("foo","bar"), ("baz","foo")]

# do replacements
for (x,y) in xs:
    regexp = re.compile(x)
    txt = regexp.sub(y, txt)

# write back data
f = open("file.txt", "w")
f.write(txt)
f.close()

python 2.6中100mb文件上的多个字符串替换

3 个答案: