Question

鉴于一个包含两列的pandas数据框，“atbats”和“hits”，按日期索引，是否有可能获得最新的历史击球率（每个atbat的平均击球次数）？例如，历史击球平均值可以是最少的atbats大于10.它有点像滚动窗口，具有条件数量的回顾周期。例如，给定：

      date, atbats, hits, 
2017-01-01,      5,    2,
2017-01-02,      6,    3,
2017-01-03,      1,    1,
2017-01-04,      12,   3,
2017-01-04,      1,    0,

第一天，没有历史编年史。在第二天，只有6.因为两者都小于10，它们可以是NaN或仅为0.

在第三天，我们将回顾过去两天，看到5 + 6个atbats，平均（2 + 3）/（5 + 6）= 0.45次点击/ atbat。

在第三天，我们将回顾过去三天并得到（2 + 3 + 1）/（5 + 6 + 1）= 0.5次点击/ atbat。

在第四天，我们会回顾最后一天，获得4/16 = 0.25次点击/ atbat。由于最后一天超过10（16），我们不需要再看了。

最终的数据框如下所示：

      date, atbats, hits,  pastAtbats, pastHits, avg,
2017-01-01,      5,    2,           0,       0,   0,
2017-01-02,      6,    3,           0,       0,   0,
2017-01-03,      1,    1,          11,       5,   0.45,
2017-01-04,      16,   4,          12,       6,   0.50,
2017-01-04,      1,    0,          16,       4,   0.25,

这种计算在熊猫中是否可行？

我能想到的唯一解决方案是纯蛮力 - 将每一行中的atbats划分，每行复制x次，其中x = atbats，然后只做一个10的滚动窗口。但在我的数据框中， “atbats”平均每天大约80，所以它将大量增加数据框的大小和要计算的窗口总数。

Answer 1

使用iterrows来实现您的需求。见下文：

原始数据框：

index atbats  hits
1       5     2
2       6     3
3       1     1
4      16     4
4       1     0
5       1     0
6      14     2
7       5     1

代码：

data = []
last = [0,0]
past_atbats = 0
past_hits = 0
for i, row in df.iterrows():
    if( last[0] >= 10):
        data.append(last.copy())
    else:
        data.append([0,0])

    if(row['atbats'] >= 10):
        last[0] = row['atbats']
        last[1] = row['hits']
    else:
        last[0] += row['atbats']
        last[1] += row['hits']

df_past = pd.DataFrame(data=data,index=df.index,columns=['past_atbats','past_hits'])
df = df.merge(df_past,left_index=True,right_index=True)
df['avg'] = df['past_hits'].divide(df['past_atbats'])

结果：

index atbats  hits  past_atbats  past_hits       avg
1       5     2            0          0       NaN
2       6     3            0          0       NaN
3       1     1           11          5  0.454545
4      16     4           12          6  0.500000
4      16     4           16          4  0.250000
4       1     0           12          6  0.500000
4       1     0           16          4  0.250000
5       1     0           17          4  0.235294
6      14     2           18          4  0.222222
7       5     1           14          2  0.142857

可能会进行优化，但我认为这会对您有所帮助。

Pandas滚动窗口，基于列总和

1 个答案: