Python Pandas 更快的滚动计算替代方案

时间:2021-02-26 08:18:29

标签: python pandas

这是原始数据:

          date name  score
0   2021-01-02    A    100
1   2021-01-03    A    120
2   2021-01-04    A    130
3   2021-01-05    A    115
4   2021-01-06    A    120
5   2021-01-07    A     70
6   2021-01-08    A     60
7   2021-01-09    A     30
8   2021-01-10    A     10
9   2021-01-11    A    100
10  2021-01-02    B     50
11  2021-01-03    B     40
12  2021-01-04    B     80
13  2021-01-05    B    115
14  2021-01-06    B    100
15  2021-01-07    B     50
16  2021-01-08    B     20
17  2021-01-09    B     40
18  2021-01-10    B    120
19  2021-01-11    B     20
20  2021-01-02    C     80
21  2021-01-03    C    100
22  2021-01-04    C    120
23  2021-01-05    C    115
24  2021-01-06    C     90
25  2021-01-07    C     80
26  2021-01-08    C    150
27  2021-01-09    C    200
28  2021-01-10    C     30
29  2021-01-11    C     40

我想获得以下输出,其中包含一个新列来计算每个名称的尾随 3 天平均值。此外,我想添加一些新的列进行逻辑计算,例如 df.score.shift(1) <= 100

          date name  score  3_day_average previous_score<=100
0   2021-01-02    A    100            NaN               False
1   2021-01-03    A    120            NaN                True
2   2021-01-04    A    130     116.666667               False
3   2021-01-05    A    115     121.666667               False
4   2021-01-06    A    120     121.666667               False
5   2021-01-07    A     70     101.666667               False
6   2021-01-08    A     60      83.333333                True
7   2021-01-09    A     30      53.333333                True
8   2021-01-10    A     10      33.333333                True
9   2021-01-11    A    100      46.666667                True
10  2021-01-02    B     50            NaN               False
11  2021-01-03    B     40            NaN                True
12  2021-01-04    B     80      56.666667                True
13  2021-01-05    B    115      78.333333                True
14  2021-01-06    B    100      98.333333               False
15  2021-01-07    B     50      88.333333                True
16  2021-01-08    B     20      56.666667                True
17  2021-01-09    B     40      36.666667                True
18  2021-01-10    B    120      60.000000                True
19  2021-01-11    B     20      60.000000               False
20  2021-01-02    C     80            NaN               False
21  2021-01-03    C    100            NaN                True
22  2021-01-04    C    120     100.000000                True
23  2021-01-05    C    115     111.666667               False
24  2021-01-06    C     90     108.333333               False
25  2021-01-07    C     80      95.000000                True
26  2021-01-08    C    150     106.666667                True
27  2021-01-09    C    200     143.333333               False
28  2021-01-10    C     30     126.666667               False
29  2021-01-11    C     40      90.000000                True

我现在将 df.groupby('name')df.apply 函数一起使用,如何使用替代方法来缩短执行时间?提前致谢!

2 个答案:

答案 0 :(得分:0)

先在 rolling 之后使用 groupby,然后是 DataFrameGroupBy.shift

df['3_day_average'] = (df.groupby('name')['score']
                         .rolling(3)
                         .mean()
                         .reset_index(level=0, drop=True))
df['previous_score<=100'] = df.groupby('name')['score'].shift() <= 100
print (df.head(15))
         date name  score  3_day_average  previous_score<=100
0  2021-01-02    A    100            NaN                False
1  2021-01-03    A    120            NaN                 True
2  2021-01-04    A    130     116.666667                False
3  2021-01-05    A    115     121.666667                False
4  2021-01-06    A    120     121.666667                False
5  2021-01-07    A     70     101.666667                False
6  2021-01-08    A     60      83.333333                 True
7  2021-01-09    A     30      53.333333                 True
8  2021-01-10    A     10      33.333333                 True
9  2021-01-11    A    100      46.666667                 True
10 2021-01-02    B     50            NaN                False
11 2021-01-03    B     40            NaN                 True
12 2021-01-04    B     80      56.666667                 True
13 2021-01-05    B    115      78.333333                 True
14 2021-01-06    B    100      98.333333                False

答案 1 :(得分:0)

data=[(0   ,'2021-01-02','A',100),
(1   ,'2021-01-03','A',120),
(2   ,'2021-01-04','A',130),
(3   ,'2021-01-05','A',115),
(4   ,'2021-01-06','A',120),
(5   ,'2021-01-07','A', 70),
(6   ,'2021-01-08','A', 60),
(7   ,'2021-01-09','A', 30),
(8   ,'2021-01-10','A', 10),
(9   ,'2021-01-11','A',100),
(10  ,'2021-01-02','B', 50),
(11  ,'2021-01-03','B', 40),
(12  ,'2021-01-04','B', 80),
(13  ,'2021-01-05','B',115),
(14  ,'2021-01-06','B',100),
(15  ,'2021-01-07','B', 50),
(16  ,'2021-01-08','B', 20),
(17  ,'2021-01-09','B', 40),
(18  ,'2021-01-10','B',120),
(19  ,'2021-01-11','B', 20),
(20  ,'2021-01-02','C', 80),
(21  ,'2021-01-03','C',100),
(22  ,'2021-01-04','C',120),
(23  ,'2021-01-05','C',115),
(24  ,'2021-01-06','C', 90),
(25  ,'2021-01-07','C', 80),
(26  ,'2021-01-08','C',150),
(27  ,'2021-01-09','C',200),
(28  ,'2021-01-10','C', 30),
(29  ,'2021-01-11','C', 40)]
header=['id','date','name','score']
df=pd.DataFrame(data,columns=header)

  
df['3d_rolling_avg'] = df.iloc[:,3].rolling(
    window=3,
    center=False
).mean()

df['shift']=df.apply(lambda x: x.shift(1))['score']
df['prev_score_lessthan_100']=df['shift'].apply(lambda x: True if (x <=100) & (x != None) else False)
print(df)

输出:

     id        date name  score  3d_rolling_avg  shift  prev_score_lessthan_100
 0    0  2021-01-02    A    100             NaN    NaN                    False
 1    1  2021-01-03    A    120             NaN  100.0                     True
 2    2  2021-01-04    A    130      116.666667  120.0                    False
 3    3  2021-01-05    A    115      121.666667  130.0                    False
 4    4  2021-01-06    A    120      121.666667  115.0                    False
 5    5  2021-01-07    A     70      101.666667  120.0                    False
 6    6  2021-01-08    A     60       83.333333   70.0                     True
 7    7  2021-01-09    A     30       53.333333   60.0                     True
 8    8  2021-01-10    A     10       33.333333   30.0                     True
 9    9  2021-01-11    A    100       46.666667   10.0                     True
 10  10  2021-01-02    B     50       53.333333  100.0                     True
 11  11  2021-01-03    B     40       63.333333   50.0                     True
 12  12  2021-01-04    B     80       56.666667   40.0                     True
 13  13  2021-01-05    B    115       78.333333   80.0                     True
 14  14  2021-01-06    B    100       98.333333  115.0                    False
 15  15  2021-01-07    B     50       88.333333  100.0                     True
 16  16  2021-01-08    B     20       56.666667   50.0                     True
 17  17  2021-01-09    B     40       36.666667   20.0                     True
 18  18  2021-01-10    B    120       60.000000   40.0                     True
 19  19  2021-01-11    B     20       60.000000  120.0                    False
 20  20  2021-01-02    C     80       73.333333   20.0                     True
 21  21  2021-01-03    C    100       66.666667   80.0                     True
 22  22  2021-01-04    C    120      100.000000  100.0                     True
 23  23  2021-01-05    C    115      111.666667  120.0                    False
 24  24  2021-01-06    C     90      108.333333  115.0                    False
 25  25  2021-01-07    C     80       95.000000   90.0                     True
 26  26  2021-01-08    C    150      106.666667   80.0                     True
 27  27  2021-01-09    C    200      143.333333  150.0                    False
 28  28  2021-01-10    C     30      126.666667  200.0                    False
 29  29  2021-01-11    C     40       90.000000   30.0                     True