我很难解决数据框中的回顾或翻转问题,或者可能是在groupby中解决。
以下是我所拥有的数据框的一个简单示例:
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
…
20141231 apple 3
20141231 grape 2
我需要计算'金额的平均值'每天前3天的每个水果,并创建以下数据框:
fruit average_in_last 3 days
20140104 apple 4
20140104 orange 10
...
例如在20140104,前3天是20140101,20140102和20140103(注意数据框中的日期不连续且20140103不存在),苹果的平均数量是(3 + 5)/ 2 = 4和橙色是10/1 = 10,其余为0。
示例数据帧非常简单,但实际数据帧更复杂,更大。希望有人能对此有所了解,提前谢谢你!
答案 0 :(得分:6)
假设我们在开始时有一个类似的数据框,
>>> df
fruit amount
2017-06-01 apple 1
2017-06-03 apple 16
2017-06-04 apple 12
2017-06-05 apple 8
2017-06-06 apple 14
2017-06-08 apple 1
2017-06-09 apple 4
2017-06-02 orange 13
2017-06-03 orange 9
2017-06-04 orange 9
2017-06-05 orange 2
2017-06-06 orange 11
2017-06-07 orange 6
2017-06-08 orange 3
2017-06-09 orange 3
2017-06-10 orange 13
2017-06-02 grape 14
2017-06-03 grape 16
2017-06-07 grape 4
2017-06-09 grape 15
2017-06-10 grape 5
>>> dates = [i.date() for i in pd.date_range('2017-06-01', '2017-06-10')]
>>> temp = (df.groupby('fruit')['amount']
.apply(lambda x: x.reindex(dates) # fill in the missing dates for each group)
.fillna(0) # fill each missing group with 0
.rolling(3)
.sum()) # do a rolling sum
.reset_index()
.rename(columns={'amount': 'sum_of_3_days',
'level_1': 'date'})) # rename date index to date col
>>> temp.head()
fruit date amount
0 apple 2017-06-01 NaN
1 apple 2017-06-02 NaN
2 apple 2017-06-03 17.0
3 apple 2017-06-04 28.0
4 apple 2017-06-05 36.0
# converts the date index into date column
>>> df = df.reset_index().rename(columns={'index': 'date'})
>>> df.merge(temp, on=['fruit', 'date'])
>>> df
date fruit amount sum_of_3_days
0 2017-06-01 apple 1 NaN
1 2017-06-03 apple 16 17.0
2 2017-06-04 apple 12 28.0
3 2017-06-05 apple 8 36.0
4 2017-06-06 apple 14 34.0
5 2017-06-08 apple 1 15.0
6 2017-06-09 apple 4 5.0
7 2017-06-02 orange 13 NaN
8 2017-06-03 orange 9 22.0
9 2017-06-04 orange 9 31.0
10 2017-06-05 orange 2 20.0
11 2017-06-06 orange 11 22.0
12 2017-06-07 orange 6 19.0
13 2017-06-08 orange 3 20.0
14 2017-06-09 orange 3 12.0
15 2017-06-10 orange 13 19.0
16 2017-06-02 grape 14 NaN
17 2017-06-03 grape 16 30.0
18 2017-06-07 grape 4 4.0
19 2017-06-09 grape 15 19.0
20 2017-06-10 grape 5 20.0
答案 1 :(得分:5)
我还想使用groupby滚动,这就是我登陆此页面的原因,但我相信我的解决方法比之前的建议更好。
您可以执行以下操作:
pivoted_df = pd.pivot_table(df, index='date', columns='fruits', values='amount')
average_fruits = pivoted_df.rolling(window=3).mean().stack()
.stack()
不是必需的,但会将您的数据透视表转换回常规df
答案 2 :(得分:3)
>>> df
>>>
fruit amount
20140101 apple 3
20140102 apple 5
20140102 orange 10
20140104 banana 2
20140104 apple 10
20140104 orange 4
20140105 orange 6
20140105 grape 1
>>> g= df.set_index('fruit', append=True).groupby(level=1)
>>> res = g['amount'].apply(pd.rolling_mean, 3, 1).reset_index('fruit')
>>> res
fruit 0
20140101 apple 3.000000
20140102 apple 4.000000
20140102 orange 10.000000
20140104 banana 2.000000
20140104 apple 6.000000
20140104 orange 7.000000
20140105 orange 6.666667
20140105 grape 1.000000
<强>更新强>
好吧,正如@cphlewis在评论中提到的那样,我的代码不会给你想要的结果。我已经检查过不同的方法,到目前为止我找到的方法是这样的(虽然不确定性能):
>>> df.index = [pd.to_datetime(str(x), format='%Y%m%d') for x in df.index]
>>> df.reset_index(inplace=True)
>>> def avg_3_days(x):
return df[(df['index'] >= x['index'] - pd.DateOffset(3)) & (df['index'] < x['index']) & (df['fruit'] == x['fruit'])].amount.mean()
>>> df['res'] = df.apply(avg_3_days, axis=1)
>>> df
index fruit amount res
0 2014-01-01 apple 3 NaN
1 2014-01-02 apple 5 3
2 2014-01-02 orange 10 NaN
3 2014-01-04 banana 2 NaN
4 2014-01-04 apple 10 4
5 2014-01-04 orange 4 10
6 2014-01-05 orange 6 7
7 2014-01-05 grape 1 NaN