Question

我有以下数据框：

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

zz=pd.read_csv(url)
zz.head(5)

    date    feccandid   feccandcfscore.dyn  pacid   paccfscore  cid     catcode     type_x  di  amtsum  state   log_diff_unemployment   party   type_y  bills   years_exp   disposition     billsum
0   2006    S8NV00073   0.496   C00000422   0.330   N00006619   H1100   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     3
1   2006    S8NV00073   0.496   C00375360   0.176   N00006619   H1100   24K     D   4500    NV  -0.024693   Republican  rep     s22-109     12  support     3
2   2006    S8NV00073   0.496   C00113803   0.269   N00006619   H1130   24K     D   2500    NV  -0.024693   Republican  rep     s22-109     12  support     2
3   2006    S8NV00073   0.496   C00249342   0.421   N00006619   H1130   24K     D   5000    NV  -0.024693   Republican  rep     s22-109     12  support     2
4   2006    S8NV00073   0.496   C00255752   0.254   N00006619   H1130   24K     D   4000    NV  -0.024693   Republican  rep     s22-109     12  support     2

我想操纵它，使得date列是索引，feccandid值是列标题（我稍后会将它们作为第二个索引，因此我可以将帧发送到面板）其他列标题成为行。期望的输出看起来这样的东西：

date    feccandid              S8NV00072    S8NV00074   S8NV00075   S8NV00076   S8NV00077
2006    feccandcfscore.dyn        0.496        0.496        0.496     0.496       0.496
2006    pacid                  C00000422    C00375360   C00113803   C00249342   C00255752
2006    paccfscore                  0.33        0.176      0.269         0.421    0.254
2006    cid N00006619           N00006619   N00006619   N00006619   N00006619
2006    catcode                  H1100      H1100          H1130    H1130      H1130
2006    type_x                    24K         24K            24K    24K     24K
2006    di                           D          D              D        D       D
2006    amtsum                      5000      4500          2500        5000       4000
2006    state                        NV        NV           NV        NV         NV
2006    log_diff_unemployment   -0.024693   -0.024693   -0.024693   -0.024693   -0.024693
2006    party                     Republican    Republican  Republican  Republican  Republican
2006    type_y                            rep         rep         rep       rep      rep
2006    bills                           s22-109      s22-109    s22-109    s22-109     s22-109
2006    years_exp                             12        12        12       12      12
2006    disposition                      support       support  support support support
2006    billsum                            3               3        2      2       2

我按照 jezrael

的推荐尝试了以下内容

zz=zz.pivot_table(index='date', columns='feccandid', aggfunc=np.mean)

zz.head()

    feccandcfscore.dyn  ...     billsum
feccandid   H0AL02087   H0AL07060   H0AR01083   H0AR02107   H0AR03055   H0AR04038   H0AZ01259   H0AZ03362   H0CA15148   H0CA19173   ...     S8MI00158   S8MN00438   S8MS00055   S8MT00010   S8NC00239   S8NE00117   S8NM00010   S8NV00073   S8OR00207   S8WI00026
date                                                                                    
2005    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2006    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     2.125   NaN     NaN
2007    NaN     0.016   NaN     NaN     NaN     -0.151  NaN     NaN     -0.777  NaN     ...     1.000000    NaN     1.666667    1.552632    NaN     NaN     2.0     1.000   NaN     2.0
2008    NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     1.285714    NaN     NaN     5.431373    NaN     NaN     NaN     NaN     NaN     NaN
2009    NaN     NaN     NaN     NaN     NaN     -0.086  NaN     NaN     -0.790  NaN     ...     NaN     NaN     NaN     2.433333    NaN     NaN     NaN     NaN     3.0     2.8

除了我试图将feccandid作为唯一的列标题和原始列标题（在最后一个示例中为 - ）之外，这与我想要的非常接近最顶层的列标题）作为行转换。

Answer 1

我认为您可以使用pivot_table（默认聚合函数为np.mean）：

df = zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)
df.columns = ['_'.join(col) for col in df.columns.values]
print df

如果您需要将NaN替换为0：

print zz.pivot_table(index='date', columns='feccandid', fill_value='0', aggfunc=np.mean)

编辑：

我创建小样本DataFrame正如ptrj所述，您可以使用T和to_panel来创建panel。那么也许你需要transpose：

import pandas as pd

zz = pd.DataFrame({'date': {0: 2001, 1: 2001, 2: 2002, 3: 2002}, 
                   'feccandid': {0: 'S8NV00072', 1: 'S8NV00074', 
                                 2: 'S8NV00072', 3: 'S8NV00074'}, 
                   'pacid': {0: 0.3, 1: 0.1, 2: 0.7, 3: 0.4},
                   'billsum': {0: 1, 1: 2, 2: 5, 3: 6}})

print zz
   billsum  date  feccandid  pacid
0        1  2001  S8NV00072    0.3
1        2  2001  S8NV00074    0.1
2        5  2002  S8NV00072    0.7
3        6  2002  S8NV00074    0.4

zz = zz.pivot_table(index='date', 
                         columns='feccandid',
                         fill_value=0, 
                         aggfunc=np.mean)
print zz.T   
date               2001  2002
        feccandid            
billsum S8NV00072   1.0   5.0
        S8NV00074   2.0   6.0
pacid   S8NV00072   0.3   0.7
        S8NV00074   0.1   0.4

wp = zz.T.to_panel()
print wp
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: 2001 to 2002
Major_axis axis: billsum to pacid
Minor_axis axis: S8NV00072 to S8NV00074

print wp.transpose(2, 0, 1)

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 2 (major_axis) x 2 (minor_axis)
Items axis: S8NV00072 to S8NV00074
Major_axis axis: 2001 to 2002
Minor_axis axis: billsum to pacid

熊猫：重塑数据框架

1 个答案: