我有一个近100000行的文件。我想做一个清理过程(小写,删除停用词等)但是需要时间。
10000的示例脚本需要15分钟。对于所有文件,我预计需要150分钟。但是需要5个小时。
在开始时文件用于阅读:
fileinput = open('tweets.txt', 'r')
lines = fileinput.read().lower() #for lower case, however it load all file
for line in fileinput:
lines = line.lower()
问题:我是否可以使用一种方法来阅读清理过程中的前10000行,然后阅读下一行的博客等?
答案 0 :(得分:2)
我强烈建议逐行操作,而不是一次性读取整个文件(换句话说,不要使用.read()
)。
with open('tweets.txt', 'r') as fileinput:
for line in fileinput:
line = line.lower()
# ... do something with line ...
# (for example, write the line to a new file, or print it)
这will automatically take advantage of Python's built-in buffering capabilities。
答案 1 :(得分:1)
尝试一次处理文件一行:
lowered = []
with open('tweets.txt', 'r') as handle:
for line in handle:
# keep accumulating the results ...
lowered.append(line.lower())
# or just dump the to stdout right away
print(line)
for line in lowered:
# print or write to file or whatever you require
这样可以减少内存开销,如果大文件可能会导致交换并导致性能下降。
以下是包含大约1M行的文件的一些基准:
# (1) real 0.223 user 0.195 sys 0.026 pcpu 98.71
with open('medium.txt') as handle:
for line in handle:
pass
# (2) real 0.295 user 0.262 sys 0.025 pcpu 97.21
with open('medium.txt') as handle:
for i, line in enumerate(handle):
pass
print(i) # 1031124
# (3) real 21.561 user 5.072 sys 3.530 pcpu 39.89
with open('medium.txt') as handle:
for i, line in enumerate(handle):
print(line.lower())
# (4) real 1.702 user 1.605 sys 0.089 pcpu 99.50
lowered = []
with open('medium.txt') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
# (5) real 2.307 user 1.983 sys 0.159 pcpu 92.89
lowered = []
with open('medium.txt', 'r') as handle:
for i, line in enumerate(handle):
lowered.append(line.lower())
with open('lowered.txt', 'w') as handle:
for line in lowered:
handle.write(line)
您还可以同时对两个文件进行迭代:
# (6) real 1.944 user 1.666 sys 0.115 pcpu 91.59
with open('medium.txt', 'r') as src, open('lowered.txt', 'w') as sink:
for i, line in enumerate(src):
sink.write(line.lower())
结果如表:
# (1) noop 0.223
# (2) w/ enumerate 0.295
# (4) list buffer 1.702
# (6) on-the-fly 1.944
# (5) r -> list buffer -> w 2.307
# (3) stdout print 21.561
答案 2 :(得分:0)
按如下方式更改脚本:
with open('tweets.txt', 'r') as fileinput:
for line in fileinput:
"""do what you need to do with each line"""
line = line.lower()
因此,基本上,不要使用read()
将整个文件读入内存,只需遍历已打开文件的行。当你将一个巨大的文件读入内存时,你的进程可能会发展到一个系统需要更换它的部分,这将使它变得非常慢。