Question

我的CSV文件最多可达10M +行。我试图获取文件的总行数，以便我可以将每个文件的处理拆分为多处理方法。为此，我将为每个要处理的子进程设置一个开始和结束行。这将我的处理时间从180秒减少到110秒，文件大小为2GB。但是，为了做到这一点，它需要知道行号计数。如果我试图获得确切的行号计数，则需要约30秒。我觉得这个时间是浪费的近似，最后一个线程可能需要读取额外的十万行左右，只会添加几秒钟，因为它可以获得精确的行数所需的30秒。

我如何获取文件的近似行数？我希望这个估计在100万行内（最好在几十万行内）。这样的事情会成为可能吗？

Answer 1

这将是非常不准确的，但它将获得一行的大小，并将其除以文件的大小。

import sys
import csv
import os

with open("example.csv", newline="") as f:
    reader = csv.reader(f)
    row1   = next(reader)

    _Size = sys.getsizeof(len("".join(row1)))

print("Size of Line 1 > ",_Size)
print("Size of File   >",str(os.path.getsize("example.csv")))
print("Approx Lines   >",(os.path.getsize("example.csv") / _Size))

（编辑）如果将最后一行更改为 math.floor(os.path.getsize("example.csv") / _Size)实际上非常准确

Answer 2

我建议您在解析之前将文件拆分成类似大小的块。

下面的示例代码将data.csv分成4个大小相等的块，方法是搜索并搜索下一个换行符。然后它将为每个块调用launch_worker()，指示工作者应该处理的数据的起始偏移量和长度。

理想情况下，您为每个工作人员使用subprocess。

import os

n_workers = 4

# open the log file, and find out how long it is
f = open('data.csv', 'rb')
length_total = f.seek(0, os.SEEK_END)

# split the file evenly among n workers
length_worker = int(length_total / n_workers)

prev_worker_end = 0

for i in range(n_workers):
    # seek to the next worker's approximate start
    file_pos = f.seek(prev_worker_end + length_worker, os.SEEK_SET)

    # see if we tried to seek past the end of the file... the last worker probably will
    if file_pos >= length_total:                                            # <-- (3)
        # ... if so, this worker's chunk extends to the end of the file
        this_worker_end = length_total

    else:
        # ... otherwise, look for the next line break
        buf = f.read(256)                                                   # <-- (1)
        next_line_end = buf.index(b'\n')                                    # <-- (2)

        this_worker_end = file_pos + next_line_end

    # calculate how long this worker's chunk is
    this_worker_length = this_worker_end - prev_worker_end
    if this_worker_length > 0:
        # if there is any data in the chunk, then try to launch a worker
        launch_worker(prev_worker_end, this_worker_length)

    # remember where the last worker got to in the file
    prev_worker_end = this_worker_end + 1

代码中标记的一些扩展：

您需要确保read()至少消耗整行。或者，如果您不知道线路可以预先存在多长时间，您可以循环执行多个read()。
这假定为\n行结尾...您可能需要修改数据。
最后一个工作人员将获得稍微少一点的数据来处理其他工作人员...这是因为我们总是搜索前一个换行符。你拥有的工人越多，最终工人获得的数据就越少。它不是很重要（在我的测试中大约200-500字节）。

确保您始终使用二进制模式，因为文字模式可以让您感到不稳定seek() s / read()。

示例launch_worker()如下所示：

def launch_worker(offset, length):
    print('Starting a worker... using chunk %d - %d (%d bytes)...' 
           % ( offset, offset + length, length ))

    with open('log.txt', 'rb') as f:
        f.seek(offset, os.SEEK_SET)
        worker_buf = f.read(length)

    lines = worker_buf.split(b'\n')

    print('First Line:')
    print('\t' + str(lines[0]))
    print('Last Line:')
    print('\t' + str(lines[-1]))

如何获取大文件的近似行数

2 个答案: