Question

我在二进制文件中有数百GB的数据。我想随机抽取一些数据，随机读取几个连续的记录。

数据存储在许多文件中。主文件不以任何特定顺序存储数据，因此每个文件都有一个排序的索引文件。我当前的代码是这样的，除了有很多文件：

index = open("foo.index", 'rb')
data = open("foo", 'rb')
index_offset_format = 'Q'
index_offset_size = struct.calcsize(index_offset_format)
record_set = []
for _ in range(n_batches):
    # Read `batch_size` offsets from the index - these are consecutive,
    # so they can be read in one operation
    index_offset_start = random.randint(0, N_RECORDS - batch_size)
    index.seek(index_offset_start)
    data_offsets = struct.iter_unpack(
        index_offset_format,
        index.read(index_offset_size * batch_size))

    # Read actual records from data file. These are not consecutive
    records = []
    for offset in data_offsets:
        data.seek(offset)
        records.append(data.read(RECORD_SIZE))
    record_set.append(records)

然后用记录完成其他事情。从分析中，我发现程序受IO严重限制，大部分时间都花在index.read和data.read上。我怀疑这是因为read阻塞：解释器在请求下一个随机数据块之前等待操作系统从磁盘读取数据，因此操作系统没有机会优化磁盘访问模式。那么：是否有一些文件API我可以传递一批指令？类似的东西：

def read_many(file, offsets, lengths):
    '''
    @param file: the file to read from
    @param offsets: the offsets to seek to
    @param lengths: the lengths of data to read
    @return an iterable over the file contents at the requested offsets
    '''

或者，是否足以打开多个文件对象并使用多线程请求多次读取？或者GIL会阻止它有用吗？

Answer 1

由于进程是IO绑定的，因此读取的绑定由操作系统的磁盘操作调度程序和磁盘的缓存设置。

使用multiprocessing.Pool.imap_unordered()：

可以轻松实现实际的每核心并行化

def pmap(fun, tasks):
    from multiprocessing import Pool
    with Pool() as pool:
        yield from pool.imap_unordered(fun, tasks)

for record_set in pmap(process_one_file, filenames):
   print(record_set)

同时打开多个文件，并且可能每个核心执行read()，应该允许磁盘调度程序找出比文件名列表强加的序列更好的计划。

imap_unordered()的美妙之处在于它将后处理从哪个，如何以及为什么比任务更早完成任务（顺序在不同的运行中可能不同）。

正如评论中所提到的， GIL 仅在执行Python代码时涉及，而I / O上的程序阻塞则不然。

从Python

1 个答案: