h5py内存中文件和多处理错误

时间:2013-02-27 16:26:01

标签: python multiprocessing hdf5 in-memory h5py

这是你的HDF5 多处理大师...首先,我知道python h5py和多处理模块不一定喜欢彼此,但我收到的错误是我只是想不通。我正在处理的脚本创建一个临时的内存 hdf5文件,将来自(pickle)输入文件的数据存储在内存文件中,然后执行多处理池(只读)对临时HDF5文件的数据进行操作。

我已经能够隔离导致错误的代码,所以这里是一个简化的代码片段。当我在生成器函数中创建内存 hdf5文件,然后使用该生成器为多处理池生成参数时,我得到一系列HDF5错误。这里有一些代码,就像我可以重新创建错误一样简单:

import h5py
from multiprocessing import Pool
from itertools import imap

useMP = True

def doNothing(arg):
    print "Do nothing with %s"%arg

def myGenerator():
    print "Create hdf5 in-memory file..."
    hdfFile = h5py.File('test.hdf',driver='core',backing_store=False)
    print "Finished creating hdf5 in-memory file."
    yield 3.14159
    '''
    # uncommenting this section will result in yet another HDF5 error.
    print "Closing hdf5 in-memory file..."
    hdfFile.close()
    print "Finished closing hdf5 in-memory file."
    '''

if useMP:
    pool = Pool(1)
    mapFunc = pool.imap
else:
    mapFunc = imap

data = [d for d in mapFunc(doNothing,myGenerator())]

当我使用itertools.imap(设置“useMP = False”)时,我得到以下输出,如预期的那样:

Create hdf5 in-memory file...
Finished creating hdf5 in-memory file.
Do nothing with 0

但是当我使用Pool.imap时,即使只用一个工作线程创建了池,我也得到了这个输出:

Create hdf5 in-memory file...
HDF5-DIAG: Error detected in HDF5 (1.8.9) thread 139951680009984:
  #000: ../../../src/H5F.c line 1538 in H5Fopen(): unable to open file
    major: File accessability
    minor: Unable to open file
  #001: ../../../src/H5F.c line 1227 in H5F_open(): unable to open file: time = Wed Feb 27 09:32:32 2013
, name = 'test.hdf', tent_flags = 1
    major: File accessability
    minor: Unable to open file
  #002: ../../../src/H5FD.c line 1101 in H5FD_open(): open failed
    major: Virtual File Layer
    minor: Unable to initialize object
  #003: ../../../src/H5FDcore.c line 464 in H5FD_core_open(): unable to open file
    major: File accessability
    minor: Unable to open file
Finished creating hdf5 in-memory file.
Do nothing with 0

奇怪的是,这个错误不会导致程序崩溃。我正在写的导致此错误的脚本实际上正如我所期望的那样 - 但是它为它创建的每个内存文件提供了上述错误。使用itertools.imap时没有错误,读取现有HDF5文件时没有错误,只有多处理和内存HDF5文件的组合。

h5py版本2.1.1

hdf5版本1.8.9

Python版本2.7.3

1 个答案:

答案 0 :(得分:2)

在挖掘了一些h5py文件之后,我发现了一个很简单的答案,但仍然不完整。 h5py.File类在h5py/_hl/files.py中定义。在调用make_fid()时创建File对象期间发生错误:

def make_fid(name, mode, userblock_size, fapl):
    """ Get a new FileID by opening or creating a file.
    Also validates mode argument."""
    ...
    elif mode == 'a' or mode is None:
        try:
            fid = h5f.open(name, h5f.ACC_RDWR, fapl=fapl)
            try:
                existing_fcpl = fid.get_create_plist()
                if userblock_size is not None and existing_fcpl.get_userblock() != userblock_size:
                    raise ValueError("Requested userblock size (%d) does not match that of existing file (%d)" % (userblock_size, existing_fcpl.get_userblock()))
            except:
                fid.close()
                raise
        except IOError:
            fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl)

如果文件存在,则以读/写模式打开。如果它不存在(内存文件的情况),则h5f.open()调用会引发异常。这是对h5f.open的调用,它触发了H5Fopen()错误消息。

仍然存在的问题是为什么只在使用多处理时打印错误?首先,我假设主线程正在调用创建HDF5文件的生成器函数。嗯,事实并非如此。多处理池实际上创建了一个新线程来处理imap和imap_unordered的任务,但不是map / map_async。将pool.imap替换为pool.map,并从主线程调用生成器函数,并且不会打印任何错误消息。那么为什么在单独的线程中创建HDF5文件会引发错误呢?

--------更新-----------

显然,h5py会自动消除主线程中的HDF5错误消息,因为h5py会处理错误。但是,它还没有自动消除子线程中的错误。解决方案:h5py._errors.silence_errors()。此功能禁用当前线程中的自动HDF5错误打印。请参阅h5py issue 206。此代码使HDF5错误无效:

import h5py
from multiprocessing import Pool
from itertools import imap
import threading

useMP = True

def doNothing(arg):
    print "Do nothing with %s"%arg

def myGenerator():
    h5py._errors.silence_errors()
    print "Create hdf5 in-memory file..."
    hdfFile = h5py.File('test.hdf',driver='core',backing_store=False)
    print "Finished creating hdf5 in-memory file."
    yield 3.14159

if useMP:
    pool = Pool(1)
    mapFunc = pool.map
else:
    mapFunc = imap

data = [d for d in mapFunc(doNothing,myGenerator())]