我有以下数据框:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'
zz=pd.read_csv(url)
zz.head(5)
date feccandid feccandcfscore.dyn pacid paccfscore cid catcode type_x di amtsum state log_diff_unemployment party type_y bills years_exp disposition billsum
0 2006 S8NV00073 0.496 C00000422 0.330 N00006619 H1100 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 3
1 2006 S8NV00073 0.496 C00375360 0.176 N00006619 H1100 24K D 4500 NV -0.024693 Republican rep s22-109 12 support 3
2 2006 S8NV00073 0.496 C00113803 0.269 N00006619 H1130 24K D 2500 NV -0.024693 Republican rep s22-109 12 support 2
3 2006 S8NV00073 0.496 C00249342 0.421 N00006619 H1130 24K D 5000 NV -0.024693 Republican rep s22-109 12 support 2
4 2006 S8NV00073 0.496 C00255752 0.254 N00006619 H1130 24K D 4000 NV -0.024693 Republican rep s22-109 12 support 2
我想操纵它,使得date
列是索引,feccandid
值是列标题(我稍后会将它们作为第二个索引,因此我可以将帧发送到面板)其他列标题成为行。期望的输出看起来这样的东西:
date feccandid S8NV00072 S8NV00074 S8NV00075 S8NV00076 S8NV00077
2006 feccandcfscore.dyn 0.496 0.496 0.496 0.496 0.496
2006 pacid C00000422 C00375360 C00113803 C00249342 C00255752
2006 paccfscore 0.33 0.176 0.269 0.421 0.254
2006 cid N00006619 N00006619 N00006619 N00006619 N00006619
2006 catcode H1100 H1100 H1130 H1130 H1130
2006 type_x 24K 24K 24K 24K 24K
2006 di D D D D D
2006 amtsum 5000 4500 2500 5000 4000
2006 state NV NV NV NV NV
2006 log_diff_unemployment -0.024693 -0.024693 -0.024693 -0.024693 -0.024693
2006 party Republican Republican Republican Republican Republican
2006 type_y rep rep rep rep rep
2006 bills s22-109 s22-109 s22-109 s22-109 s22-109
2006 years_exp 12 12 12 12 12
2006 disposition support support support support support
2006 billsum 3 3 2 2 2
我按照 jezrael
的推荐尝试了以下内容zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)
zz.head()
feccandcfscore.dyn ... billsum
feccandid H0AL02087 H0AL07060 H0AR01083 H0AR02107 H0AR03055 H0AR04038 H0AZ01259 H0AZ03362 H0CA15148 H0CA19173 ... S8MI00158 S8MN00438 S8MS00055 S8MT00010 S8NC00239 S8NE00117 S8NM00010 S8NV00073 S8OR00207 S8WI00026
date
2005 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2006 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN 2.125 NaN NaN
2007 NaN 0.016 NaN NaN NaN -0.151 NaN NaN -0.777 NaN ... 1.000000 NaN 1.666667 1.552632 NaN NaN 2.0 1.000 NaN 2.0
2008 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.285714 NaN NaN 5.431373 NaN NaN NaN NaN NaN NaN
2009 NaN NaN NaN NaN NaN -0.086 NaN NaN -0.790 NaN ... NaN NaN NaN 2.433333 NaN NaN NaN NaN 3.0 2.8
除了我试图将feccandid
作为唯一的列标题和原始列标题(在最后一个示例中为 - )之外,这与我想要的非常接近最顶层的列标题)作为行转换。
答案 0 :(得分:1)
我认为您可以使用pivot_table
(默认聚合函数为np.mean
):
df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df
如果您需要将NaN
替换为0
:
print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
编辑:
我创建小样本DataFrame
正如ptrj所述,您可以使用T
和to_panel
来创建panel
。那么也许你需要transpose
:
import pandas as pd
zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002},
'feccandid': {0: 'S8NV00072', 1: 'S8NV00074',
2: 'S8NV00072', 3: 'S8NV00074'},
'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})
print zz
billsum date feccandid pacid
0 1 2001 S8NV00072 0.3
1 2 2001 S8NV00074 0.1
2 5 2002 S8NV00072 0.7
3 6 2002 S8NV00074 0.4
zz = zz.pivot_table(index='date',
columns='feccandid',
fill_value=0,
aggfunc=np.mean)
print zz.T
date 2001 2002
feccandid
billsum S8NV00072 1.0 5.0
S8NV00074 2.0 6.0
pacid S8NV00072 0.3 0.7
S8NV00074 0.1 0.4
wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074
print wp.transpose(2, 0, 1)
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid