我需要使用Python脚本逐行读取大数据文件(~200GB)。
我尝试过逐行方法,但这些方法使用了大量内存。我希望能够通过块读取文件块。
是否有更好的方法逐行加载大文件,比如说
a)明确提到文件在内存中任何时候都可以加载的最大行数?要么 b)通过大小的块(例如1024字节)加载它,只要所述块的最后一行加载完全没有被截断?
答案 0 :(得分:2)
不要一次阅读所有内容,请尝试逐行阅读:
with open("myFile.txt") as f:
for line in f:
#Do stuff with your line
或者,如果您想一次读取N行:
with open("myFile.txt") as myfile:
head = [next(myfile) for x in xrange(N)]
print head
要处理来自命中文件末尾的StopIteration
错误,这是一个简单的try/catch
(虽然有很多方法)。
try:
head = [next(myfile) for x in xrange(N)]
except StopIteration:
rest_of_lines = [line for line in myfile]
或者您可以随意阅读最后一行。
答案 1 :(得分:0)
要遍历文件的行,不要使用readlines
。相反,迭代文件本身(您将找到使用xreadlines
的版本 - 已弃用并且只返回文件对象本身)或:
with open(the_path, 'r') as the_file:
for line in the_file:
# Do stuff with the line
要一次读取多行,可以在文件上使用next
(它是迭代器),但需要捕获StopIteration
,表示没有数据:< / p>
with open(the_path, 'r') as the_file:
the_lines = []
done = False
for i in range(number_of_lines): # Use xrange on Python 2
try:
the_lines.append(next(the_file))
except StopIteration:
done = True # Reached end of file
# Do stuff with the lines
if done:
break # No data left
当然,您也可以使用指定字节数的块加载文件:
with open(the_path, 'r') as the_file:
while True:
data = the_file.read(the_byte_count)
if len(data) == 0:
# All data is gone
break
# Do stuff with the data chunk