Question

祝大家节日快乐！

我必须在简单的笔记本电脑上处理大csv文件（每个大约5GB），所以我正在学习以块的形式阅读文件（我是一个完整的菜鸟），特别是使用python 2.7。我找到了这个非常好的例子

# chunked file reading
from __future__ import division
import os

def get_chunks(file_size):
    chunk_start = 0
    chunk_size = 0x20000  # 131072 bytes, default max ssl buffer size
    while chunk_start + chunk_size &lt; file_size:
        yield(chunk_start, chunk_size)
        chunk_start += chunk_size

    final_chunk_size = file_size - chunk_start
    yield(chunk_start, final_chunk_size)

def read_file_chunked(file_path):
    with open(file_path) as file_:
        file_size = os.path.getsize(file_path)

        print('File size: {}'.format(file_size))

        progress = 0

        for chunk_start, chunk_size in get_chunks(file_size):

            file_chunk = file_.read(chunk_size)

            # do something with the chunk, encrypt it, write to another file...

            progress += len(file_chunk)
            print('{0} of {1} bytes read ({2}%)'.format(
                progress, file_size, int(progress / file_size * 100))
            )

if __name__ == '__main__':
    read_file_chunked('some-file.gif')

（来源：https://gist.github.com/richardasaurus/21d4b970a202d2fffa9c）

但对我来说仍然不太清楚。例如，假设我写了一段代码，我想在我的数据集的一小部分上测试它，只是为了检查它是否正常运行。我怎么能只读取我的csv文件的前10％并在该块上运行我的代码而不必将其余的数据集存储在内存中？我欣赏任何提示 - 如果与使用python的分块文件相关，甚至一些阅读或外部引用也是好的。谢谢！

Answer 1

让我们考虑以下CSV文件：

如果您使用记事本或任何简单的文本编辑器打开此CSV文件，您可以看到：

CU-C2376;Airbus A380;50.00;259.00
J2-THZ;Boeing 737;233.00;213.00
SU-XBG;Embraer ERJ-195;356.00;189.00
TI-GGH;Boeing 737;39.00;277.00
HK-6754J;Airbus A380;92.00;93.00
6Y-VBU;Embraer ERJ-195;215.00;340.00
9N-ABU;Embraer ERJ-195;151.00;66.00
YV-HUI;Airbus A380;337.00;77.00

如果仔细观察，每条线对应一行，每个值用“;”分隔。

假设我想只读前三行，那么：

with open('data.csv') as f:
     lines = list()
     for i in range(3):
          lines.append(f.readline())
     #Do some stuff with the first three lines

这是一种更好的方法来读取文件块，因为我们的文件是10MB，如果你先读取3MB，你读到的最后一个字节可能不代表任何东西。

或者您可以使用像熊猫这样的库..

使用Python进行分块

1 个答案: