Question

我有几个1 + gb的URL文本文件。我尝试使用Python进行查找和替换，以便快速删除URL。

由于这些文件很大，我不想将它们加载到内存中。

我的代码适用于50行的小型测试文件，但是当我在大文本文件上使用此代码时，它实际上会使文件变大。

import re
import sys

def ProcessLargeTextFile():
    with open("C:\\Users\\Combined files\\test2.txt", "r") as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
        for line in r:
            line = line.replace('https://twitter.com/', '')
            w.write(line)
    return

ProcessLargeTextFile()
print("Finished")

小文件我测试了我的代码，结果是twitter用户名（根据需要）

USERNAME_1

USERNAME_2

username_3

而大文件导致

https://twitter.com/username_1਍ഀ

https://twitter.com/username_2਍ഀ

https://twitter.com/username_3਍ഀ

Answer 1

这是文件编码的问题，这有效：

import re

def main():
    inputfile = open("1-10_no_dups_split_2.txt", "r", encoding="UTF-16")
    outputfile = open("output.txt", "a", encoding="UTF-8")
    for line in inputfile:
        line = re.sub("^https://twitter.com/", "", line)
        outputfile.write(line)
    outputfile.close()

main()

诀窍是在读取时指定UTF-16，然后将其输出为UTF-8。而中提琴，奇怪的东西消失了:)我做了很多工作用Python移动文本文件。有很多设置你可以用编码来自动替换某些字符，什么不是，如果你进入奇怪的地方，只需阅读关于“打开”命令，或回到这里:)。

快速查看结果，你可能想要一些正则表达式，这样你就可以抓住https://mobile.twitter.com/和其他东西，但这是另一个故事..祝你好运！

Answer 2

您可以使用open（）方法的缓冲参数。这是它的代码。

import re
import sys

def ProcessLargeTextFile():
    with open("C:\\Users\\Combined files\\test2.txt", "r",buffering=200000000) as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
        for line in r:
            line = line.replace('https://twitter.com/', '')
            w.write(line)
    return

ProcessLargeTextFile()
print("Finished")

所以我一次在内存中读取20 MB的数据。

在Python中处理大型.txt文件仅适用于小文件

2 个答案: