Question

我需要使用Python脚本逐行读取大数据文件（~200GB）。

我尝试过逐行方法，但这些方法使用了大量内存。我希望能够通过块读取文件块。

是否有更好的方法逐行加载大文件，比如说

a）明确提到文件在内存中任何时候都可以加载的最大行数？要么 b）通过大小的块（例如1024字节）加载它，只要所述块的最后一行加载完全没有被截断？

Answer 1

不要一次阅读所有内容，请尝试逐行阅读：

with open("myFile.txt") as f:
    for line in f:
        #Do stuff with your line

或者，如果您想一次读取N行：

with open("myFile.txt") as myfile:
    head = [next(myfile) for x in xrange(N)]
    print head

要处理来自命中文件末尾的StopIteration错误，这是一个简单的try/catch（虽然有很多方法）。

try:
    head = [next(myfile) for x in xrange(N)]
except StopIteration:
    rest_of_lines = [line for line in myfile]

或者您可以随意阅读最后一行。

Answer 2

要遍历文件的行，不要使用readlines。相反，迭代文件本身（您将找到使用xreadlines的版本 - 已弃用并且只返回文件对象本身）或：

with open(the_path, 'r') as the_file:
    for line in the_file:
        # Do stuff with the line

要一次读取多行，可以在文件上使用next（它是迭代器），但需要捕获StopIteration，表示没有数据：< / p>

with open(the_path, 'r') as the_file:
    the_lines = []
    done = False
    for i in range(number_of_lines): # Use xrange on Python 2
        try:
            the_lines.append(next(the_file))
        except StopIteration:
            done = True # Reached end of file
    # Do stuff with the lines
    if done:
        break # No data left

当然，您也可以使用指定字节数的块加载文件：

with open(the_path, 'r') as the_file:
    while True:
        data = the_file.read(the_byte_count)
        if len(data) == 0:
            # All data is gone
            break
        # Do stuff with the data chunk

Python巨大的文件阅读

2 个答案: