使用两个轴MultiIndexed将Pandas DataFrame存储到HDF5

时间:2016-03-28 07:10:12

标签: python pandas

我认为应该可以存储带有MultiIndexed轴的DataFrame。但是,我收到以下错误:

In [1]: index = pd.MultiIndex.from_product([['Foo', 'Bar'],['One','Two','Three']])
        column = pd.MultiIndex.from_product([['foo', 'bar'],['one','two','three']])
        df = pd.DataFrame(np.random.rand(6,6), index=index, columns=column)
        df
Out[1]:                 foo                                    bar
                        one          two          three        one          two          three
        Foo     One     0.605352     0.882382     0.472946     0.615619     0.108022     0.389674
                Two     0.746384     0.594509     0.556881     0.457000     0.529793     0.929574
                Three   0.270978     0.956778     0.515201     0.626850     0.852708     0.861962
        Bar     One     0.219994     0.648191     0.677824     0.408439     0.079326     0.414059
                Two     0.186167     0.767103     0.880667     0.205253     0.647471     0.449379
                Three   0.353171     0.249900     0.723791     0.458349     0.977604     0.691188

In [2]: with pd.HDFStore('test.h5', 'w') as store:
            store.append('output', df)
Out[2]: ---------------------------------------------------------------------------
        ValueError                                Traceback (most recent call last)
        <ipython-input-55-87e96c141a7f> in <module>()
              1 with pd.HDFStore('test.h5', 'w') as store:
        ----> 2     store.append('output', df)

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in append(self, key, value, format, append, columns, dropna, **kwargs)
            917         kwargs = self._validate_format(format, kwargs)
            918         self._write_to_group(key, value, append=append, dropna=dropna,
        --> 919                              **kwargs)
            920 
            921     def append_to_multiple(self, d, value, selector, data_columns=None,

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, index, append, complib, encoding, **kwargs)
           1262 
           1263         # write the object
        -> 1264         s.write(obj=value, append=append, complib=complib, **kwargs)
           1265 
           1266         if s.is_table and index:

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in write(self, obj, data_columns, **kwargs)
           4195                 data_columns.insert(0, n)
           4196         return super(AppendableMultiFrameTable, self).write(
        -> 4197             obj=obj, data_columns=data_columns, **kwargs)
           4198 
           4199     def read(self, **kwargs):

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, **kwargs)
           3785         self.create_axes(axes=axes, obj=obj, validate=append,
           3786                          min_itemsize=min_itemsize,
        -> 3787                          **kwargs)
           3788 
           3789         for a in self.axes:

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize, **kwargs)
           3383             axis, axis_labels = self.non_index_axes[0]
           3384             data_columns = self.validate_data_columns(
        -> 3385                 data_columns, min_itemsize)
           3386             if len(data_columns):
           3387                 mgr = block_obj.reindex_axis(

        /home/kartik/miniconda3/lib/python3.5/site-packages/pandas/io/pytables.py in validate_data_columns(self, data_columns, min_itemsize)
           3246         if info.get('type') == 'MultiIndex' and data_columns:
           3247             raise ValueError("cannot use a multi-index on axis [{0}] with "
        -> 3248                              "data_columns {1}".format(axis, data_columns))
           3249 
           3250         # evaluate the passed data_columns, True == use all columns

        ValueError: cannot use a multi-index on axis [1] with data_columns ['level_1', 'level_0']

我最有意义的是存储这样的数据。主要是因为我的需求会有很大差异。对于某些应用程序,我将需要所有行和所有列。对于其他许多人来说,我只需要所有行和一个父列:说我需要foo下的所有行。我可能还只需要一个父行和一个父列:Foo, foo

我当然需要所有辅助行和列。

在我的例子中,主行索引是States,辅助行索引是传感器名称,主列索引是感测到的不同内容,辅助列索引是传感器输出的统计信息。因此,很容易看出我可能只需要一种类型的感知数据用于所有状态,或仅用于一种状态,或者我可能需要从一个州或所有州感知到的所有事物。

我正在寻找修复错误或更好的方法来存储数据。

2 个答案:

答案 0 :(得分:1)

如果您在存储format='fixed'时强制执行df,则可以保留您的MultiIndex:

with pd.HDFStore('test.h5', 'w') as store:
    store.put('output', df, format='fixed')
    print store['output']

              foo                     bar                
              one     two   three     one     two   three
Foo One    0.9626  0.9761  0.4385  0.2976  0.0882  0.7589
    Two    0.7842  0.7563  0.4796  0.5664  0.1511  0.9345
    Three  0.3364  0.4271  0.4107  0.9009  0.5207  0.4082
Bar One    0.9892  0.4595  0.1485  0.1456  0.9935  0.1386
    Two    0.3187  0.7908  0.2947  0.7354  0.5759  0.9102
    Three  0.0499  0.1865  0.8113  0.4815  0.1427  0.3322

但是你会失去一些功能(例如,使用.append()方法)。根据您的需要,这可能是也可能不是问题。

答案 1 :(得分:0)

我最终堆叠了DataFrame一次,以便在存储之前删除列中的MultiIndex:

with pd.HDFStore('test.h5', 'w') as store:
    store.append('output', df.stack().astype(float))

这不是理想的,但HDF5表的querying功能可以补偿它。