使用一个对象来存储异构数据(DataFrame(2D)和Series(1D))的合适的Pandas数据结构?

时间:2016-02-23 16:10:32

标签: python pandas data-structures

我正在将Pandas用于各种应用程序并且非常欣赏它,因为它让我的生活变得更轻松。

在大多数情况下,我正在使用同质数据并知道哪个data structure最适合。到目前为止,我主要与(多索引)DataFrames和Series并行工作,效果很好。

但是我在当前的项目中遇到了困难,在一个共同的对象中处理异构数据(1D和2D数据)会很有帮助。

我尝试使用Panel3D对象,希望能够显示我在寻找的内容:

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df

               C1        C2        C3
concept                              
A       -0.555291 -1.026308 -0.016192
A       -1.759410  0.023008 -0.168303
B       -0.471165  1.160105  0.862017
B       -2.583058  0.595113  0.729354
C        0.706030  1.518058 -1.760176
C       -0.290667 -0.737529 -0.177824

df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2

               C4        C5        C6
concept                              
A        0.784534 -0.590447 -0.661132
A       -0.443176  0.423495 -1.171204
B        1.103484  1.295225  0.112374
B        0.097899 -0.879873  0.213401
C       -1.117570 -0.577390  1.714902
C        1.476986  1.191201  0.973319

# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
my_panel = pd.Panel(data)
my_panel.describe
my_panel.ix['Item2', 'A', 'C4']

concept
A    0.784534
A   -0.443176

# add a series to the dataframe (combine heterogenous data)
s = pd.Series(['gpsol', 125, 'my_simulation_x'],
              index=['solver', 'runtime', 'simulation_name'])
s

solver                       gpsol
runtime                        125
simulation_name    my_simulation_x

# this doesn't work and throws an error as a panel is not the right
# data structure
#  "AssertionError: Length of data and index must match"    
data = {'Item1': df, 'Item2': df2, 'Item3': s}
my_panel = pd.Panel(data)

我知道Panel3D并不打算拥有不同尺寸的数据,但如果我有一个(可切片的)数据结构可以集成1D和2D对象,那就太棒了。

在pandas中是否有这样的东西,或者我必须为此使用单独的pandas对象?

如果答案是"没有。熊猫不是为此而做的。"它也将是o.k ..我只想知道是否有适合此目的的东西。

提前致谢!

1 个答案:

答案 0 :(得分:0)

我已经为我的案例找到了一个合适的解决方案,只需将(dict of)Series作为属性添加到DataFrame / Panel对象。

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np

# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df

df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2

# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
opt_results = pd.Panel(data)


# add a series to the dataframe (combine heterogenous data)
opt_params = pd.Series(['gpsol', 125, 'my_simulation_x'],
                       index=['solver', 'runtime', 'simulation_name'])

# this doesn't work and throws an error because of different indexes/dimensions
#data = {'Item1': df, 'Item2': df2, 'Item3': s}
#my_panel = pd.Panel(data)

# but setting the series as an attribute is sufficient for me
opt_results.info = opt_params
opt_results.info
solver                       gpsol
runtime                        125
simulation_name    my_simulation_x
dtype: object

opt_results.ix['Item2', 'A', 'C4']
concept
A   -0.660582
A   -1.174828
Name: C4, dtype: float64

也许这有点令人困惑,因为答案太明显了。