我想在多个MPI进程中生成大型DataFrame,并将这些DataFrame存储到HDF5文件的子组中,每个进程等级都将其写入其子组。
这是我试图做的:
import numpy as np
import pandas as pd
from mpi4py import MPI
def main():
comm = MPI.COMM_WORLD
rank = comm.rank
# 4 MPI processes for 4 number attributes.
numbers = np.arange(1,4)
# Each MPI process is responsible for its own dataframe.
numbergroup = "data/number/%d" % rank
# Create a random multidim DataFrame on each process.
iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
mindex = pd.MultiIndex.from_product(iterables, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 4), index=mindex)
pdStore = pd.HDFStore("df.h5")
# Store the dataframe in a separate sub-group per process.
pdStore[numbergroup] = df
pdStore.close()
if __name__=="__main__":
main()
并执行:
mpirun -np 4 python pandas-multidim-hdf5-mpi.py
失败如下:
关闭组数据问题关闭中 files:df.h5 ... atexit._run_exitfuncs中的错误:Traceback(最新 最后调用):文件“ /usr/lib/python3.7/site-packages/tables/file.py”, _close_nodes中的第516行 node._g_close()文件“ /usr/lib/python3.7/site-packages/tables/group.py”,第910行,在 _g_close self._g_close_group()在table.hdf5extension.Group._g_close_group中的文件“ tables / hdf5extension.pyx”,行1090 table.exceptions.HDF5ExtError:HDF5错误回溯跟踪
文件“ H5G.c”,行723,在H5Gclose中 不是一个团体
I'm aware of the h5py module,但我想轻松地存储和读取多维DataFrame。通过HDF5的熊猫接口可以做到这一点吗?