Question

我有一个尺寸合适的.tsv文件，其中包含以下格式的文件

ID  DocType NormalizedName  DisplayName Year    Description
12648   Book    a fancy title   A FaNcY-Title   2005    This is a short description of the book
1867453 Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay
...

此文件的缩小版大小约为67 GB，压缩后约为22GB。

我想基于ID（大约3亿行）以递增顺序对文件的行进行排序。每行的ID都是唯一的，范围为1-2147483647（long的正数），可能会有空格。

不幸的是，我最多只有8GB的可用内存，因此我将无法一次加载整个文件。

对列表进行排序并将其写回到磁盘的最省时的方法是什么？

Answer 1

我使用heapq.merge作了概念验证：

第1步：生成测试文件

生成包含3亿行的测试文件：

from random import randint
row = '{} Essay   on the history of humans    On the history of humans    2016    This is another short description, this time of the essay\n'
with open('large_file.tsv', 'w') as f_out:
    for i in range(300_000_000):
        f_out.write(row.format(randint(1, 2147483647)))

第2步：分成多个块并对每个块进行排序

每个块都有100万行：

import glob

path = "chunk_*.tsv"

chunksize = 1_000_000
fid = 1
lines = []

with open('large_file.tsv', 'r') as f_in:
    f_out = open('chunk_{}.tsv'.format(fid), 'w')
    for line_num, line in enumerate(f_in, 1):
        lines.append(line)
        if not line_num % chunksize:
            lines = sorted(lines, key=lambda k: int(k.split()[0]))
            f_out.writelines(lines)

            print('splitting', fid)
            f_out.close()
            lines = []
            fid += 1
            f_out = open('chunk_{}.tsv'.format(fid), 'w')

    # last chunk
    if lines:
        print('splitting', fid)
        lines = sorted(f, key=lambda k: int(k.split()[0]))
        f_out.writelines(lines)
        f_out.close()
        lines = []

第3步：合并每个块

from heapq import merge

chunks = []
for filename in glob.glob(path):
    chunks += [open(filename, 'r')]

with open('sorted.tsv', 'w') as f_out:
    f_out.writelines(merge(*chunks, key=lambda k: int(k.split()[0])))

时间：

我的机器是Ubuntu Linux 18.04，AMD 2400G，便宜的WD SSD绿色）

第2步-拆分和排序块-花费了〜12分钟

第3步-合并块-花费了〜10分钟

我希望这些值在具有更好磁盘（NVME？）和CPU的计算机上会更低。

Python：对不适合内存的大列表进行排序

1 个答案: