如何获取最近n行中大于当前行的值的计数?
想象一下我们有一个如下的数据框:
col_a
0 8.4
1 11.3
2 7.2
3 6.5
4 4.5
5 8.9
我正在尝试获取一个表格,如下所示,其中n = 3。
col_a col_b
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
谢谢。
答案 0 :(得分:3)
在熊猫中最好不要循环,因为速度慢,最好在自定义函数中使用rolling
:
n = 3
df['new'] = (df['col_a'].rolling(n+1, min_periods=1)
.apply(lambda x: (x[-1] < x[:-1]).sum())
.astype(int))
print (df)
col_a new
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
如果性能很重要,请使用strides:
n = 3
x = np.concatenate([[np.nan] * (n), df['col_a'].values])
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = rolling_window(x, n + 1)
df['new'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
print (df)
col_a new
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
性能:在小窗口n = 3
中使用perfplot
:
np.random.seed(1256)
n = 3
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
def roll(df):
df['new'] = (df['col_a'].rolling(n+1, min_periods=1).apply(lambda x: (x[-1] < x[:-1]).sum(), raw=True).astype(int))
return df
def list_comp(df):
df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() for i, j in df['col_a'].items()]
return df
def strides(df):
x = np.concatenate([[np.nan] * (n), df['col_a'].values])
arr = rolling_window(x, n + 1)
df['new1'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
return df
def make_df(n):
df = pd.DataFrame(np.random.randint(20, size=n), columns=['col_a'])
return df
perfplot.show(
setup=make_df,
kernels=[list_comp, roll, strides],
n_range=[2**k for k in range(2, 15)],
logx=True,
logy=True,
xlabel='len(df)')
我也对大窗口n = 100
的性能感到好奇:
答案 1 :(得分:1)
n = 3
df['col_b'] = df.apply(lambda row: sum(row.col_a <= df.col_a.loc[row.name - n: row.name-1]), axis=1)
Out[]:
col_a col_b
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0
答案 2 :(得分:1)
在pd.Series.items
中使用列表理解:
n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() \
for i, j in df['col_a'].items()]
等效地,使用enumerate
:
n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-n):i]).sum() \
for i, j in enumerate(df['col_a'])]
结果:
print(df)
col_a count
0 8.4 0
1 11.3 0
2 7.2 2
3 6.5 3
4 4.5 3
5 8.9 0