Question

我有一个pandas.DataFrame，我以属性的形式添加了一些元信息。我想用这个来保存/恢复df，但它会在保存过程中被删除：

import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pck')
new_df = pd.read_pickle('test.pck')
new_df.my_attribute

# AttributeError: 'DataFrame' object has no attribute 'my_attribute'

其他文件格式似乎更糟糕：如果您不小心，csv和json会丢弃type，index或column信息。也许创建一个扩展DataFrame的新类？愿意接受。

Answer 1

这里没有通用或任何接近的标准，但有一些选项

1）一般建议 - 除了最短的术语序列化之外，我不会使用pickle（比如＆lt; 1天）

2）任意元数据可以打包成两种二进制格式pandas支持，msgpack和HDF5，以ad-hoc方式授予。你也可以这样做我们CSV等，但它变得更加特别。

# msgpack
data = {'df': df, 'my_attribute': df.my_attribute}
pd.to_msgpack('tmp.msg', data)
pd.read_msgpack('tmp.msg')['my_attribute']
# Out[70]: 'can I recover this attribute after saving?'

# hdf
with pd.HDFStore('tmp.h5') as store:
    store.put('df', df)
    store.get_storer('df').attrs.my_attribute = df.my_attribute    
with pd.HDFStore('tmp.h5') as store:
    df = store.get('df')
    df.my_attribute = store.get_storer('df').attrs.my_attribute

df.my_attribute
Out[79]: 'can I recover this attribute after saving?'

3）xarray，这是pandas的n-d扩展支持存储到NetCDF文件格式，它具有更内置的元数据概念

import xarray
ds = xarray.Dataset.from_dataframe(df)
ds.attrs['my_attribute'] = df.my_attribute

ds.to_netcdf('test.cdf')
ds = xarray.open_dataset('test.cdf')
ds
Out[8]: 
<xarray.Dataset>
Dimensions:            (index: 150)
Coordinates:
  * index              (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Data variables:
    sepal length (cm)  (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
    sepal width (cm)   (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
    petal length (cm)  (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
    petal width (cm)   (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Attributes:
    my_attribute:  can I recover this attribute after saving?

使用自定义属性保存/加载数据框

1 个答案: