熊猫分组,重新采样,计算pct_change,然后将结果存储回原始频率。数据框

时间:2019-08-29 06:07:04

标签: python pandas pandas-groupby resampling

我有一个每日库存数据的数据框,该数据框由datetimeindex索引。

有多个库存条目,因此有重复的datetimeindex值。

我正在寻找一种方法:

  1. 按库存代码对数据框进行分组
  2. 将每个交易品种的价格重新采样为每月价格频率数据
  3. 对每个符号组每月价格执行pct_change计算
  4. 将其存储为原始数据框中的新列“ monthly_return”。

我已经能够管理前三个操作。我在遇到麻烦时将结果存储在原始数据帧中。

为了说明这一点,我创建了一个玩具数据集,其中包含一个“虚拟”索引(idx)列,该列稍后将用于帮助在第三个代码块中创建所需的输出。

    import random
    import pandas as pd
    import numpy as np

    datelist = pd.date_range(pd.datetime(2018,1,1), periods=PER).to_pydatetime().tolist() * 2
    ids = [random.choice(['A', 'B']) for i in range(len(datelist))]
    prices = random.sample(range(200), len(datelist))
    idx = range(len(datelist))
    df1 = pd.DataFrame(data=zip(idx, ids, prices), index=datelist, columns='idx label prices'.split())

    print(df1.head(10))

df1

                idx label  prices
    2018-01-01    0     B      40
    2018-01-02    1     A     190
    2018-01-03    2     A     159
    2018-01-04    3     A      25
    2018-01-05    4     A      89
    2018-01-06    5     B     164
    ...
    2018-01-31   30     A     102
    2018-02-01   31     A     117
    2018-02-02   32     A     120
    2018-02-03   33     B      75
    2018-02-04   34     B     170
    ...

所需的输出

                 idx label  prices  monthly_return
    2018-01-01    0     B      40        0.000000
    2018-01-02    1     A     190        0.000000
    2018-01-03    2     A     159        0.000000
    2018-01-04    3     A      25        0.000000
    2018-01-05    4     A      89        0.000000
    2018-01-06    5     B     164        0.000000
    ...
    2018-01-31   30     A     102       -0.098039
    2018-02-01   31     A     117        0.000000
    2018-02-02   32     A     120        0.000000
    ...
    2018-02-26   56     B     152        0.000000
    2018-02-27   57     B       2        0.000000
    2018-02-28   58     B      49       -0.040816
    2018-03-01   59     B     188        0.000000
    ...
    2018-01-28   89     A      88        0.000000
    2018-01-29   90     A      26        0.000000
    2018-01-30   91     B     128        0.000000
    2018-01-31   92     A     144       -0.098039
    ...
    2018-02-26  118     A      92        0.000000
    2018-02-27  119     B     111        0.000000
    2018-02-28  120     B      34       -0.040816
    ...

到目前为止,我尝试过的是:

    dfX = df1.copy(deep=True)
    dfX = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)
    print(dfX)

哪个输出:

    label            
    A      2018-01-31   -0.067961
           2018-02-28   -0.364583
           2018-03-31    0.081967
    B      2018-01-31    1.636364
           2018-02-28   -0.557471
           2018-03-31         NaN

这与我想做的非常接近,但是我只能从月末获得pct_change数据,这很烦人以新列的形式存储在原始数据帧(df1)中。

类似的事情不起作用:

    dfX = df1.copy(deep=True)
    dfX['monthly_return'] = df1.groupby('label').resample('M')['prices'].last().pct_change(1).shift(-1)

因为它会产生错误:

    TypeError: incompatible index of inserted column with frame index

我已经考虑将“ monthly_return”数据“上采样”回每日序列,但是由于原始数据集可能缺少日期(例如周末),因此这可能最终导致上述错误。此外,重置索引以清除此错误仍然会产生问题,因为分组的dfX的行数/频率与原始df1的行数/频率(每天的频率)不同。

我有预感,这可以通过使用多索引和数据帧合并来完成,但是我不确定如何去做。

1 个答案:

答案 0 :(得分:0)

这会生成我想要的输出,但是它不像我期望的那样干净

df1的生成与以前相同(有问题的代码给出):

                idx label  prices
    2018-01-01    0     A     145
    2018-01-02    1     B      86
    2018-01-03    2     B     141
    ...
    2018-01-25   86     B      12
    2018-01-26   87     B      71
    2018-01-27   88     B     186
    2018-01-28   89     B     151
    2018-01-29   90     A     161
    2018-01-30   91     B     143
    2018-01-31   92     B      88
    ...

然后:

    def fun(x):
        dates = x.date
        x = x.set_index('date', drop=True)
        x['monthly_return'] = x.resample('M').last()['prices'].pct_change(1).shift(-1)
        x = x.reindex(dates)
        return x

    dfX = df1.copy(deep=True)
    dfX.reset_index(inplace=True)
    dfX.columns = 'date idx label prices'.split()

    dfX = dfX.groupby('label').apply(fun).droplevel(level='label')
    print(dfX)

输出所需结果(未排序)

                idx label  prices  monthly_return
    date                                         
    2018-01-01    0     A     145             NaN
    2018-01-06    5     A      77             NaN
    2018-01-08    7     A      48             NaN
    2018-01-09    8     A      31             NaN
    2018-01-11   10     A      20             NaN
    2018-01-12   11     A      27             NaN
    2018-01-14   13     A     109             NaN
    2018-01-15   14     A     166             NaN
    2018-01-17   16     A     130             NaN
    2018-01-18   17     A     139             NaN
    2018-01-19   18     A     191             NaN
    2018-01-21   20     A     164             NaN
    2018-01-22   21     A     112             NaN
    2018-01-23   22     A     167             NaN
    2018-01-25   24     A     140             NaN
    2018-01-26   25     A      42             NaN
    2018-01-30   29     A     107             NaN
    2018-02-04   34     A       9             NaN
    2018-02-07   37     A      84             NaN
    2018-02-08   38     A      23             NaN
    2018-02-10   40     A      30             NaN
    2018-02-12   42     A      89             NaN
    2018-02-15   45     A      79             NaN
    2018-02-16   46     A     115             NaN
    2018-02-19   49     A     197             NaN
    2018-02-21   51     A      11             NaN
    2018-02-26   56     A     111             NaN
    2018-02-27   57     A     126             NaN
    2018-03-01   59     A     135             NaN
    2018-03-03   61     A      28             NaN
    2018-01-01   62     A     120             NaN
    2018-01-03   64     A     170             NaN
    2018-01-05   66     A      45             NaN
    2018-01-07   68     A     173             NaN
    2018-01-08   69     A     158             NaN
    2018-01-09   70     A      63             NaN
    2018-01-11   72     A      62             NaN
    2018-01-12   73     A     168             NaN
    2018-01-14   75     A     169             NaN
    2018-01-15   76     A     142             NaN
    2018-01-17   78     A      83             NaN
    2018-01-18   79     A      96             NaN
    2018-01-21   82     A      25             NaN
    2018-01-22   83     A      90             NaN
    2018-01-23   84     A      59             NaN
    2018-01-29   90     A     161             NaN
    2018-02-01   93     A     150             NaN
    2018-02-04   96     A      85             NaN
    2018-02-06   98     A     124             NaN
    2018-02-14  106     A     195             NaN
    2018-02-16  108     A     136             NaN
    2018-02-17  109     A     134             NaN
    2018-02-18  110     A     183             NaN
    2018-02-19  111     A      32             NaN
    2018-02-24  116     A     102             NaN
    2018-02-25  117     A      72             NaN
    2018-02-27  119     A      38             NaN
    2018-03-02  122     A     137             NaN
    2018-03-03  123     A     171             NaN
    2018-01-02    1     B      86             NaN
    2018-01-03    2     B     141             NaN
    2018-01-04    3     B     189             NaN
    2018-01-05    4     B      60             NaN
    2018-01-07    6     B       1             NaN
    2018-01-10    9     B      87             NaN
    2018-01-13   12     B      44             NaN
    2018-01-16   15     B     147             NaN
    2018-01-20   19     B      92             NaN
    2018-01-24   23     B      81             NaN
    2018-01-27   26     B     190             NaN
    2018-01-28   27     B      24             NaN
    2018-01-29   28     B     116             NaN
    2018-01-31   30     B      98        1.181818
    2018-02-01   31     B     121             NaN
    2018-02-02   32     B     110             NaN
    2018-02-03   33     B      66             NaN
    2018-02-05   35     B       4             NaN
    2018-02-06   36     B      13             NaN
    2018-02-09   39     B     114             NaN
    2018-02-11   41     B      16             NaN
    2018-02-13   43     B     174             NaN
    2018-02-14   44     B      78             NaN
    2018-02-17   47     B     144             NaN
    2018-02-18   48     B      14             NaN
    2018-02-20   50     B     133             NaN
    2018-02-22   52     B     156             NaN
    2018-02-23   53     B     159             NaN
    2018-02-24   54     B     177             NaN
    2018-02-25   55     B      43             NaN
    2018-02-28   58     B      19       -0.338542
    2018-03-02   60     B     127             NaN
    2018-01-02   63     B       2             NaN
    2018-01-04   65     B      97             NaN
    2018-01-06   67     B       8             NaN
    2018-01-10   71     B      54             NaN
    2018-01-13   74     B     106             NaN
    2018-01-16   77     B      74             NaN
    2018-01-19   80     B     188             NaN
    2018-01-20   81     B     172             NaN
    2018-01-24   85     B      51             NaN
    2018-01-25   86     B      12             NaN
    2018-01-26   87     B      71             NaN
    2018-01-27   88     B     186             NaN
    2018-01-28   89     B     151             NaN
    2018-01-30   91     B     143             NaN
    2018-01-31   92     B      88        1.181818
    2018-02-02   94     B      75             NaN
    2018-02-03   95     B     103             NaN
    2018-02-05   97     B      82             NaN
    2018-02-07   99     B     128             NaN
    2018-02-08  100     B     123             NaN
    2018-02-09  101     B      52             NaN
    2018-02-10  102     B      18             NaN
    2018-02-11  103     B      21             NaN
    2018-02-12  104     B      50             NaN
    2018-02-13  105     B      64             NaN
    2018-02-15  107     B     185             NaN
    2018-02-20  112     B     125             NaN
    2018-02-21  113     B     108             NaN
    2018-02-22  114     B     132             NaN
    2018-02-23  115     B     180             NaN
    2018-02-26  118     B      67             NaN
    2018-02-28  120     B     192       -0.338542
    2018-03-01  121     B      58             NaN

也许有一种更简洁,更Python化的方式来做到这一点。