如何与Pandas同时应用重新取样和分组?

时间:2016-06-20 09:18:10

标签: python-2.7 pandas indexing group-by resampling

我的目标是在pandas中添加行,以便用以前的数据替换丢失的数据,同时重新采样日期。 我的数据包含不同的产品ID,我每次必须做一个groupBy,因为我必须保留每个productId的时间系列数据。 示例: 这是我的数据框:

   productId    popularity  converted_timestamp     date
0     1            5         2015-12-01           2015-12-01
1     1            8         2015-12-02           2015-12-02
2     1            6         2015-12-04           2015-12-04
3     1            9         2015-12-07           2015-12-07
4     2            5         2015-12-01           2015-12-01
5     2           10         2015-12-03           2015-12-03
6     2            6         2015-12-04           2015-12-04
7     2           12         2015-12-07           2015-12-07
8     2           11         2015-12-09           2015-12-09

这就是我想要的:

      date     productId  popularity    converted_timestamp
0   2015-12-01    1          5          2015-12-01
1   2015-12-02    1          8          2015-12-02
2   2015-12-03    1          8          2015-12-02
3   2015-12-04    1          6          2015-12-04
4   2015-12-05    1          6          2015-12-04
5   2015-12-06    1          6          2015-12-04
6   2015-12-07    1          9          2015-12-07
7   2015-12-01    2          5          2015-12-01
8   2015-12-02    2          5          2015-12-01
9   2015-12-03    2         10          2015-12-03
10  2015-12-04    2          6          2015-12-04
11  2015-12-05    2          6          2015-12-04
12  2015-12-06    2          6          2015-12-04
13  2015-12-07    2         12          2015-12-07
14  2015-12-08    2         12          2015-12-07
15  2015-12-09    2         11          2015-12-09

这是我的代码:

df.set_index('date').groupby('productId', group_keys=False).apply(lambda df: df.resample('D').ffill()).reset_index()

它有效,而且非常完美! 所以我的新数据看起来像这样:

           productId    popularity  converted_timestamp    date
11960909    15620743.0  526888.0    2016-01-11          2016-01-11
11960910    15620743.0  487450.0    2016-02-26          2016-02-26
11960911    15620743.0  487450.0    2016-02-26          2016-02-26
12355593    17175984.0  751990.0    2016-01-28          2016-01-28
12355594    17175984.0  584549.0    2016-01-26          2016-01-26
12355595    17175984.0  587289.0    2016-01-26          2016-01-26
12355596    17175984.0  574454.0    2016-01-26          2016-01-26
12355597    17175984.0  570663.0    2016-01-26          2016-01-26
12355598    17175984.0  566914.0    2016-01-26          2016-01-26
12355599    17175984.0  591241.0    2016-01-26          2016-01-26
12355600    17175984.0  590637.0    2016-01-26          2016-01-26
12355601    17175984.0  556794.0    2016-01-27          2016-01-27
12355602    17175984.0  512403.0    2016-02-10          2016-02-10
12355603    17175984.0  510561.0    2016-02-10          2016-02-10
12355604    17175984.0  513907.0    2016-02-10          2016-02-10
12355605    17175984.0  512403.0    2016-02-10          2016-02-10
12355606    17175984.0  511038.0    2016-02-10          2016-02-10
12355607    17175984.0  510561.0    2016-02-10          2016-02-10
12355608    17175984.0  554359.0    2016-01-27          2016-01-27
17028384    16013607.0  563480.0    2016-02-21          2016-02-21
17028385    16013607.0  563480.0    2016-02-21          2016-02-21
17028386    16013607.0  563480.0    2016-02-21          2016-02-21
17028387    16013607.0  563480.0    2016-02-21          2016-02-21
17028388    16013607.0  563480.0    2016-02-21          2016-02-21
17028389    16013607.0  563480.0    2016-02-21          2016-02-21
17028390    16013607.0  563480.0    2016-02-21          2016-02-21
17028391    16013607.0  563480.0    2016-02-21          2016-02-21
17028392    16013607.0  546230.0    2016-02-14          2016-02-14
17028393    16013607.0  546230.0    2016-02-14          2016-02-14
17028394    16013607.0  546230.0    2016-02-14          2016-02-14
17028395    16013607.0  546230.0    2016-02-14          2016-02-14
17028396    16013607.0  546230.0    2016-02-14          2016-02-14
17028397    16013607.0  546230.0    2016-02-14          2016-02-14
17028398    16013607.0  546230.0    2016-02-14          2016-02-14
17028399    16013607.0  546230.0    2016-02-14          2016-02-14

相同的代码提供此错误消息: ValueError:无法使用方法或限制重新索引非唯一索引

为什么?救命 ? 谢谢。

1 个答案:

答案 0 :(得分:1)

有重复 - 一种可能的解决方案:

df = df.groupby(['productId','converted_timestamp','date'], as_index=False)['popularity']
       .mean()
print (df)
    productId converted_timestamp       date     popularity
0  15620743.0          2016-01-11 2016-01-11  526888.000000
1  15620743.0          2016-02-26 2016-02-26  487450.000000
2  16013607.0          2016-02-14 2016-02-14  546230.000000
3  16013607.0          2016-02-21 2016-02-21  563480.000000
4  17175984.0          2016-01-26 2016-01-26  580821.000000
5  17175984.0          2016-01-27 2016-01-27  555576.500000
6  17175984.0          2016-01-28 2016-01-28  751990.000000
7  17175984.0          2016-02-10 2016-02-10  511812.166667

然后你可以使用(pandas 0.18.1):

df = df.set_index('date')
       .groupby('productId', group_keys=False)
       .resample('D')
       .ffill()
       .reset_index()