我希望为数据帧中的每一行进行局部范围的计算,同时避免缓慢的for
循环。例如,对于下面数据中的每一行,我想找到接下来3天内(包括当日)的最高温度以及接下来3天内的总雨量:
Day Temperature Rain
0 30 4
1 31 14
2 31 0
3 30 0
4 33 5
5 34 0
6 32 0
7 33 2
8 31 5
9 29 9
理想的输出将是下表中的新列。第0天的TempMax显示第0天和第2天之间的最高温度,RainTotal显示第0天和第2天之间的降雨量之和。
Day Temperature Rain TempMax RainTotal
0 30 4 31 18
1 31 14 31 14
2 31 0 33 5
3 30 0 34 5
4 33 5 34 5
5 34 0 34 2
6 32 0 33 7
7 33 2 33 16
8 31 5 31 14
9 29 9 29 9
当前我正在使用for
循环:
# Make empty arrays to store each row's max & sum values
temp_max = np.zeros(len(df))
rain_total = np.zeros(len(df))
# Loop through the df and do operations in the local range [i:i+2]
for i in range(len(df)):
temp_max[i] = df['Temperature'].iloc[i:i+2].max()
rain_total = df['Rain'].iloc[i:i+2].sum()
# Insert the arrays to df
df['TempMax'] = temp_max
df['RainTotal'] = rain_total
for
循环完成了工作,但是用我的数据帧花费了50分钟。是否有可能将其扩大化或以其他方式使之更快?
谢谢你!
答案 0 :(得分:3)
通过索引将Series.rolling
用于变更顺序,将max
与sum
一起使用:
df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
print (df)
Day Temperature Rain TempMax RainTotal
0 0 30 4 31.0 18.0
1 1 31 14 31.0 14.0
2 2 31 0 33.0 5.0
3 3 30 0 34.0 5.0
4 4 33 5 34.0 5.0
5 5 34 0 34.0 2.0
6 6 32 0 33.0 7.0
7 7 33 2 33.0 16.0
8 8 31 5 31.0 14.0
9 9 29 9 29.0 9.0
另一种更快的解决方案,在numpy中将strides
用于2d数组,然后将numpy.nanmax
与numpy.nansum
结合使用:
n = 2
t = np.concatenate([df['Temperature'].values, [np.nan] * (n)])
r = np.concatenate([df['Rain'].values, [np.nan] * (n)])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
print (df)
Day Temperature Rain TempMax RainTotal
0 0 30 4 31.0 18.0
1 1 31 14 31.0 14.0
2 2 31 0 33.0 5.0
3 3 30 0 34.0 5.0
4 4 33 5 34.0 5.0
5 5 34 0 34.0 2.0
6 6 32 0 33.0 7.0
7 7 33 2 33.0 16.0
8 8 31 5 31.0 14.0
9 9 29 9 29.0 9.0
性能:
#[100000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [23]: %%timeit
...: df['TempMax'] = np.nanmax(rolling_window(t, n + 1), axis=1)
...: df['RainTotal'] = np.nansum(rolling_window(r, n + 1), axis=1)
...:
8.36 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [24]: %%timeit
...: df['TempMax'] = df['Temperature'].iloc[::-1].rolling(3, min_periods=1).max()
...: df['RainTotal'] = df['Rain'].iloc[::-1].rolling(3, min_periods=1).sum()
...:
20.4 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 1 :(得分:1)
对于Day
连续几天都有数据的情况,我们可以采用快速的NumPy和SciPy工具进行救援-
from scipy.ndimage.filters import maximum_filter1d
N = 2 # window length
temp = df['Temperature'].to_numpy()
rain = df['Rain'].to_numpy()
df['TempMax'] = maximum_filter1d(temp,N+1,origin=-1,mode='nearest')
df['RainTotal'] = np.convolve(rain,np.ones(N+1,dtype=int))[N:]
样本输出-
In [27]: df
Out[27]:
Day Temperature Rain TempMax RainTotal
0 0 30 4 31 18
1 1 31 14 31 14
2 2 31 0 33 5
3 3 30 0 34 5
4 4 33 5 34 5
5 5 34 0 34 2
6 6 32 0 33 7
7 7 33 2 33 16
8 8 31 5 31 14
9 9 29 9 29 9