Question

I'm writing a python script in which I read a big file ~5 GB line by line, make some modifications in each of the lines, and then write it to another file.

When I use the function file.readlines() for reading the input file, my disk usage reaches ~90% and the disk speed reaches +100Mbps (i know this method shouldn't be used for large files).

I haven't measured the program execution time for the above case as my system becomes unresponsive (the memory gets full).

When I use an iterator like below (And this is what I'm actually using in my code)

with open('file.csv', 'r') as inFile:
    for line in inFile:

My disk usage remains < 10% and speed are < 5 Mbps and it takes ~20 minutes for the program to finish execution for a 5 GB file. Wouldn't this time be lower if my disk usage was high?

Also, does it really take ~20 minutes to read a 5 GB, process it line by line making some modifications on each line and finally writing it to a new file or am I doing something wrong?

What I can't figure out is why doesn't the program use my system to its full potential when performing the io operations. Because if it did, then my disk usage should have been higher, right?.

Answer 1

我不认为您的主要问题是正在阅读该文件，因为您正在使用open（），而我会检查您在这里做了什么：

在每行中进行一些修改，然后将其写入另一个文件。

因此，尝试阅读该文件而不对其他文件进行修改/写入修改，以了解系统只需读取文件所需的内容。

以下是我在阅读this，this，this和this

后在我的环境中测试的方式

首先，创建了一个1.2GB的文件：

timeout 5 yes "Ergnomic systems for c@ts that works too much" >> foo

我没有使用dd或truncate，这会在读取文件时导致内存错误。

现在有一些I / O测试读取文件，这是一个已经优化的操作，如@Serge Ballesta提到：

#!/usr/bin/python
with open('foo') as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.647s
user    0m2.343s
sys     0m0.327s

使用open（）更改缓冲选项：

# --------------------------------------NO BUFFERING
with open('foo','r',0) as infile:
    for line in infile:
        pass
    print 'file readed'

$ time python io_test.py
file readed

real    0m2.787s
user    0m2.406s
sys     0m0.374s

# --------------------------------------ONE LINE BUFFERED
with open('foo','r',1) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m4.331s
user    0m2.468s
sys     0m1.811s
# -------------------------------------- 70 MB/s
with open('foo','r',700000000) as infile:
  for line in infile:
    pass
  print 'file readed' 

$ time python io_test.py
file readed

real    0m3.137s
user    0m2.311s
sys     0m0.827s

为什么你不应该使用readlines：

with open('foo') as f:
    lines = f.readlines()
    for line in lines:
        pass

$ time python io_test.py

real    0m6.428s
user    0m3.858s
sys     0m2.499s

Answer 2

在Python中逐行读取文件已经是一种优化操作：Python从磁盘加载内部缓冲区并将其分配给调用者。这意味着Python解释器已经在内存中完成了行标识。

通常，当磁盘访问是限制因素，内存绑定或处理器绑定时，处理可以是磁盘IO绑定。如果您使用某些网络，它可能是网络IO绑定或远程服务器绑定，仍然取决于什么是限制因素。在逐行处理文件时，进程很可能不受内存限制。要确定磁盘IO是否是限制部分，您可以尝试使用系统复制实用程序简单地复制该文件。如果大约需要20分钟，那么该过程就是磁盘IO绑定，如果速度快得多，则线路上的修改不能忽略。

无论如何，在内存中加载一个大文件总是一个坏主意......

Answer 3

It simply depends on the size buffer you use for reading the file.

Lets look at an example:

You have a file which contains 20 characters.

Your buffer size is 2 characters.

Then you have to use at least 10 system calls for reading the entire time.

A system call is a very expensive operation because the kernel has to switch the executing context.

If you have a buffer which is 20 characters in size you just need 1 system call and therefore only one kernel trap is nescessary.

I assume that the first function simply uses a bigger buffer internally.

Answer 4

您不仅需要文件的RAM，还需要输入和输出缓冲区以及修改后的文件的第二个副本。这很容易压倒你的RAM。如果您不想阅读，修改在for循环中写入每一行，您可能希望将一些行组合在一起。这可能会使读/写速度更快，但代价是更多的算法开销。在一天结束时，我会使用逐行方法。 HTH！ LUI

Does high disk usage mean faster file read/write operations?

4 个答案: