如何有效地获取DataFrame行的索引,这些行符合某些累积条件?

时间:2016-11-23 18:03:25

标签: python performance pandas vectorization

例如,我希望得到一个字母,表示其他列中至少连续两次下降的时间段开始的行。

示例性数据:

   a  b
0  3  a
1  2  b
2  3  c
3  2  d
4  1  e
5  0  f
6 -1  g
7  3  h
8  1  i
9  0  j

简单循环的示例性解决方案:

import pandas as pd

df = pd.DataFrame({'a': [3,2,3,2,1,0,-1,3,1,0], 'b': list('abcdefghij')})

less = 0
l = []
prev_prev_row = df.iloc[0]
prev_row = df.iloc[1]
if prev_row['a'] < prev_prev_row['a']: less = 1
for i, row in df.iloc[2:len(df)].iterrows():
    if row['a'] < prev_row['a']:
        less = less + 1
    else:
        less = 0
    if less == 2:
        l.append(prev_prev_row['b'])
    prev_prev_row = prev_row
    prev_row = row

这会给出列表l

['c', 'h']

2 个答案:

答案 0 :(得分:3)

这是一种方法,在NumPyScipy的帮助下 -

from scipy.ndimage.morphology import binary_closing

arr = df.a.values
mask1 = np.hstack((False,arr[1:] < arr[:-1],False))
mask2 = mask1 & (~binary_closing(~mask1,[1,1]))
final_mask = mask2[1:] > mask2[:-1]
out = list(df.b[final_mask])

答案 1 :(得分:2)

反向使用rolling(2)

s = df.a[::-1].diff().gt(0).rolling(2).sum().eq(2)
df.b.loc[s & (s != s.shift(-1))]

2    c
7    h
Name: b, dtype: object

如果你真的想要一个清单

df.b.loc[s & (s != s.shift(-1))].tolist()

['c', 'h']