根据未来值在数据框中设置当前行

时间:2016-09-15 15:35:23

标签: python pandas

假设:

    d = {
    'datetime': ['2010-01-08 09:45:00', '2010-01-08 10:00:00',
               '2010-01-08 10:15:00', '2010-01-08 10:30:00',
               '2010-01-08 10:45:00', '2010-01-08 11:00:00',
               '2010-01-08 11:15:00', '2010-01-08 11:30:00',
               '2010-01-08 11:45:00', '2010-01-08 12:00:00',
               '2010-01-08 12:15:00', '2010-01-08 12:30:00',
               '2010-01-08 12:45:00', '2010-01-08 13:00:00',
               '2010-01-08 13:15:00', '2010-01-08 13:30:00',
               '2010-01-08 13:45:00', '2010-01-08 14:00:00',
               '2010-01-08 14:15:00', '2010-01-08 14:30:00',
               '2010-01-08 14:45:00', '2010-01-08 15:00:00',
               '2010-01-08 15:15:00', '2010-01-08 15:30:00',
               '2010-01-08 15:45:00', '2010-01-08 16:00:00',
               '2010-01-08 16:15:00'],
    'Total-tops': [0,-1,-1,2,3,0,0,4,0,0,0,0,5,6,7,8,-1,0,0,0,0,0,0,0,-1,-1,2]
}

df = pandas.DataFrame(d)
df = df.set_index('datetime')

我想添加另一个列,它是一个布尔值,表示该行是否会中断。中断意味着顶部的数字大于1,然后在将来的某个地方出现-1。例如,前2个将在它遇到的下一个-1处中断。这是所需的数据框: desired_dataframe

这是我目前使用的函数,但它运行速度非常慢,因为我遍历所有行。

def does_break(data):
    cur_breaks = []

    for index, row in data.iterrows():
        if row['Total-tops'] > 1:
            # Get all rows after this time that are new tops
            breaks = data[(data['Total-tops'] == -1) & (data.index.time > index.time())]
            if len(breaks) > 0:
                cur_breaks.append(True)
            else:
                cur_breaks.append(False)
        else:
            cur_breaks.append(False)
    return cur_breaks

2 个答案:

答案 0 :(得分:1)

你可以使用笨拙的表达

In [56]: import numpy as np

In [57]: ((np.cumsum((df['Total-tops'] == -1)[:: -1])[:: -1] > 0) & (df['Total-tops'] > 0)).astype(int)
Out[57]: 
datetime
2010-01-08 09:45:00    0
2010-01-08 10:00:00    0
2010-01-08 10:15:00    0
2010-01-08 10:30:00    1
2010-01-08 10:45:00    1
2010-01-08 11:00:00    0
2010-01-08 11:15:00    0
2010-01-08 11:30:00    1
2010-01-08 11:45:00    0
2010-01-08 12:00:00    0
2010-01-08 12:15:00    0
2010-01-08 12:30:00    0
2010-01-08 12:45:00    1
2010-01-08 13:00:00    1
2010-01-08 13:15:00    1
2010-01-08 13:30:00    1
2010-01-08 13:45:00    0
2010-01-08 14:00:00    0
2010-01-08 14:15:00    0
2010-01-08 14:30:00    0
2010-01-08 14:45:00    0
2010-01-08 15:00:00    0
2010-01-08 15:15:00    0
2010-01-08 15:30:00    0
2010-01-08 15:45:00    0
2010-01-08 16:00:00    0
2010-01-08 16:15:00    0
Name: Total-tops, dtype: int64

(当然,对于新专栏,您可以使用df['breaks'] = ...。)

这样做如下:

  1. 我们找到值为-1的位置,然后反向。现在,我们过去所做的任何操作(特别是cumsum)都是真正的未来。
  2. 我们找到累积总和,然后再反转。在这一点上,意思是将来我们会看到多少次-1。
  3. 我们发现结果大于0的位置,因为我们不关心多少次我们会看到-1,只有是否我们会看到它。
  4. 最后,我们还要求当前条目为正数。这只是你问题的定义。

答案 1 :(得分:1)

这个怎么样:

latest_break = df.index[(df['Total-tops'] == -1)].max()
df['break'] = 1
df['break'] = df['break'].where((df['Total-tops'] > 0) & (df.index < latest_break), 0)

对于在最近一次中断之前发生的所有正值,将break设置为1