我正在将Pandas用于各种应用程序并且非常欣赏它,因为它让我的生活变得更轻松。
在大多数情况下,我正在使用同质数据并知道哪个data structure最适合。到目前为止,我主要与(多索引)DataFrames和Series并行工作,效果很好。
但是我在当前的项目中遇到了困难,在一个共同的对象中处理异构数据(1D和2D数据)会很有帮助。
我尝试使用Panel3D对象,希望能够显示我在寻找的内容:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df
C1 C2 C3
concept
A -0.555291 -1.026308 -0.016192
A -1.759410 0.023008 -0.168303
B -0.471165 1.160105 0.862017
B -2.583058 0.595113 0.729354
C 0.706030 1.518058 -1.760176
C -0.290667 -0.737529 -0.177824
df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2
C4 C5 C6
concept
A 0.784534 -0.590447 -0.661132
A -0.443176 0.423495 -1.171204
B 1.103484 1.295225 0.112374
B 0.097899 -0.879873 0.213401
C -1.117570 -0.577390 1.714902
C 1.476986 1.191201 0.973319
# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
my_panel = pd.Panel(data)
my_panel.describe
my_panel.ix['Item2', 'A', 'C4']
concept
A 0.784534
A -0.443176
# add a series to the dataframe (combine heterogenous data)
s = pd.Series(['gpsol', 125, 'my_simulation_x'],
index=['solver', 'runtime', 'simulation_name'])
s
solver gpsol
runtime 125
simulation_name my_simulation_x
# this doesn't work and throws an error as a panel is not the right
# data structure
# "AssertionError: Length of data and index must match"
data = {'Item1': df, 'Item2': df2, 'Item3': s}
my_panel = pd.Panel(data)
我知道Panel3D并不打算拥有不同尺寸的数据,但如果我有一个(可切片的)数据结构可以集成1D和2D对象,那就太棒了。
在pandas中是否有这样的东西,或者我必须为此使用单独的pandas对象?
如果答案是"没有。熊猫不是为此而做的。"它也将是o.k ..我只想知道是否有适合此目的的东西。
提前致谢!
答案 0 :(得分:0)
我已经为我的案例找到了一个合适的解决方案,只需将(dict of)Series作为属性添加到DataFrame / Panel对象。
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# dataframes
df = pd.DataFrame(np.random.randn(6, 3))
df['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df.set_index(['concept'], inplace=True)
df.sort_index(inplace=True)
df.columns = ['C1', 'C2', 'C3']
df
df2 = pd.DataFrame(np.random.randn(6, 3))
df2['concept'] = np.repeat(np.repeat(['A', 'B', 'C'], 2), 1)
df2.set_index(['concept'], inplace=True)
df2.sort_index(inplace=True)
df2.columns = ['C4', 'C5', 'C6']
df2
# combine dataframes in a panel object (combine homegenous data)
data = {'Item1': df, 'Item2': df2}
opt_results = pd.Panel(data)
# add a series to the dataframe (combine heterogenous data)
opt_params = pd.Series(['gpsol', 125, 'my_simulation_x'],
index=['solver', 'runtime', 'simulation_name'])
# this doesn't work and throws an error because of different indexes/dimensions
#data = {'Item1': df, 'Item2': df2, 'Item3': s}
#my_panel = pd.Panel(data)
# but setting the series as an attribute is sufficient for me
opt_results.info = opt_params
opt_results.info
solver gpsol
runtime 125
simulation_name my_simulation_x
dtype: object
opt_results.ix['Item2', 'A', 'C4']
concept
A -0.660582
A -1.174828
Name: C4, dtype: float64
也许这有点令人困惑,因为答案太明显了。