合并数据透视表与"长数据" DATAFORMAT

时间:2018-02-15 08:40:30

标签: python pandas pivot-table

更新!

请注意在df和数据透视表中(取消堆叠后)将年份更改为int。这给我带来了一些麻烦:)

值的数据:

d = {'ID':[1,1,1,2,2,2],'Date':['01-01-2013','01-02-2013','01-03-2013','01-
01-2008','01-02-2008','01-03-2008'],'CUSIP':
['X1','X1','X1','X2','X2','X2'],'X':['bla','bla','bla','bla','bla','bla']}
df = pd.DataFrame(data=d)

我有一个数据框:

   Identifier CUSIP    X       Date
0           1    X1  bla 2013-01-01
1           1    X1  bla 2013-01-02
2           1    X1  bla 2013-01-03
3           2    X2  bla 2008-01-01
4           2    X2  bla 2008-01-02
5           2    X2  bla 2008-01-03

和数据透视表:

       2008  2009  2010  2011  2012  2013
CUSIP                                    
X1        1     1     1     1     1     1
X2        2     2     2     2     2     2

我希望实现如下布局:

   Identifier CUSIP    X       Date Values
0           1    X1  bla 2013-01-01 1
1           1    X1  bla 2013-01-02 1
2           1    X1  bla 2013-01-03 1
3           2    X2  bla 2008-01-01 2
4           2    X2  bla 2008-01-02 2
5           2    X2  bla 2008-01-03 2

3 个答案:

答案 0 :(得分:2)

您可以使用stack重新df2join进行重新结合:

#if necessary
df['Date'] = pd.to_datetime(df['Date'])
df['year'] = df.Date.dt.year

df1 = df.join(df1.stack().rename('val'), on=['CUSIP', 'year'])
print (df1)
   Identifier       Date CUSIP    X  year  val
0           1 2013-01-01    X1  bla  2013    1
1           1 2013-01-02    X1  bla  2013    1
2           1 2013-04-03    X1  bla  2013    1
3           2 2008-01-01    X2  bla  2008    2
4           2 2008-01-02    X2  bla  2008    2
5           2 2008-03-03    X2  bla  2008    2

替代解决方案:

df1 = df.join(df1.stack().rename('val'), on=[df['CUSIP'], df['Date'].dt.year])
print (df1)
   Identifier       Date CUSIP    X  val
0           1 2013-01-01    X1  bla    1
1           1 2013-01-02    X1  bla    1
2           1 2013-04-03    X1  bla    1
3           2 2008-01-01    X2  bla    2
4           2 2008-01-02    X2  bla    2
5           2 2008-03-03    X2  bla    2

我相信您可以transform使用year sizemeansum这样的功能:

df['Date'] = pd.to_datetime(df['Date'])

df['Vals'] = df.groupby(['CUSIP', df['Date'].dt.year])['X'].transform('size')
print (df)
   Identifier       Date CUSIP    X  Vals
0           1 2013-01-01    X1  bla     5
1           1 2013-01-02    X1  bla     5
2           1 2013-04-03    X1  bla     5
3           1 2013-04-04    X1  bla     5
4           1 2013-05-05    X1  bla     5
5           2 2008-01-01    X2  bla     4
6           2 2008-01-02    X2  bla     4
7           2 2008-03-03    X2  bla     4
8           2 2008-03-04    X2  bla     4

答案 1 :(得分:2)

我就是这样做的,它看起来很复杂但实际上并不多,我只是在解释这些步骤。
从这样的数据框开始:

   Identifier CUSIP    X       Date
0           1    X1  bla 2013-01-01
1           1    X1  bla 2013-01-02
2           1    X1  bla 2013-01-03
3           2    X2  bla 2008-01-01
4           2    X2  bla 2008-01-02
5           2    X2  bla 2008-01-03

使用df['year'] = df.Date.dt.year

添加年份列
   Identifier CUSIP    X       Date  year
0           1    X1  bla 2013-01-01  2013
1           1    X1  bla 2013-01-02  2013
2           1    X1  bla 2013-01-03  2013
3           2    X2  bla 2008-01-01  2008
4           2    X2  bla 2008-01-02  2008
5           2    X2  bla 2008-01-03  2008

然后使用您的数据透视表和stack。 (如果使用数据透视表,了解堆栈/取消堆栈将极大地帮助您)

       2008  2009  2010  2011  2012  2013
CUSIP                                    
X1        1     1     1     1     1     1
X2        2     2     2     2     2     2

>>> piv.stack()
CUSIP      
X1     2008    1
       2009    1
       2010    1
       2011    1
       2012    1
       2013    1
X2     2008    2
       2009    2
       2010    2
       2011    2
       2012    2
       2013    2

然后您需要通过CUSIP和年份reindex,以便值与数据帧的顺序相同。

>>> piv.stack().reindex(df[['CUSIP', 'year']])
CUSIP      
X1     2013    1
       2013    1
       2013    1
X2     2008    2
       2008    2
       2008    2
dtype: int64

所有在一起:

>>> df['pivot_values'] = piv.stack().reindex(df[['CUSIP', 'year']]).values
>>> df
   Identifier CUSIP    X       Date  year  pivot_values
0           1    X1  bla 2013-01-01  2013             1
1           1    X1  bla 2013-01-02  2013             1
2           1    X1  bla 2013-01-03  2013             1
3           2    X2  bla 2008-01-01  2008             2
4           2    X2  bla 2008-01-02  2008             2
5           2    X2  bla 2008-01-03  2008             2

答案 2 :(得分:2)

假设我的数据框为df

df

  CUSIP        Date  ID    X
0    X1  01-01-2013   1  bla
1    X1  01-02-2013   1  bla
2    X1  01-03-2013   1  bla
3    X2  01-01-2008   2  bla
4    X2  01-02-2008   2  bla
5    X2  01-03-2008   2  bla

数据透视表是pv

pv

       2008  2009  2010  2011  2012  2013
CUSIP                                    
X1        1     1     1     1     1     1
X2        2     2     2     2     2     2

解决方案

使用pd.DataFrame.lookup

由于您的日期只是字符串,因此我会将其传递给pd.to_datetime。我还要确保pv列为整数

df.assign(
    PV_Values=
    pv.rename(columns=int).lookup(
        df.CUSIP, pd.to_datetime(df.Date).dt.year
    )
)

  CUSIP        Date  ID    X  PV_Values
0    X1  01-01-2013   1  bla          1
1    X1  01-02-2013   1  bla          1
2    X1  01-03-2013   1  bla          1
3    X2  01-01-2008   2  bla          2
4    X2  01-02-2008   2  bla          2
5    X2  01-03-2008   2  bla          2

注意
如果pv列已经intdf.Date已经datetime,那么这只是:

df.assign(PV_Values=pv.lookup(df.CUSIP, df.Date.dt.year))

  CUSIP        Date  ID    X  PV_Values
0    X1  01-01-2013   1  bla          1
1    X1  01-02-2013   1  bla          1
2    X1  01-03-2013   1  bla          1
3    X2  01-01-2008   2  bla          2
4    X2  01-02-2008   2  bla          2
5    X2  01-03-2008   2  bla          2