根据数值有条件创建数据框列

时间:2019-01-12 13:15:43

标签: python pandas conditional

我有一个如下所示的pandas数据帧时间序列(大约1000行和下面的四列):

Date          Values  Avg    +1 Stdev
01/01/2010    1.01    1.00   1.05
02/01/2010    1.02    1.00   1.05
03/01/2010    1.04    1.00   1.05
04/01/2010    -0.97   1.00   1.05
05/01/2010    1.12    1.00   1.05
06/01/2010    1.08    1.00   1.05
....

我想做的是创建第五列(称为“触发日期”),如果第2列中的值超出第4列中设置的阈值,则新列将返回日期(来自索引列) ),否则不返回任何值。 此处的附加约束是,如果先前的值已超过第4列中的阈值,则第五列也不应返回日期。

换句话说,问题的伪代码为:

If df['Values'] > df['+1 Stdev']
AND
If df['Values'] (for the row above) < df['+1 Stdev']
THEN
Return df['Date'] in new column df['Trigger Date']
ELSE
Leave row in df['Trigger Date'] blank

在解决此问题方面的任何帮助将不胜感激

编辑:附加问题-以任何方式添加第三个约束,如果过去XX天(例如过去30天内)已经发生触发日期,则不返回任何触发日期?因此预期如下所示:

         Date  Values  Avg  +1 Stdev Trigger Date
0  01/01/2010    1.01  1.0      1.05          NaN
1  02/01/2010    1.02  1.0      1.05          NaN
2  03/01/2010    1.04  1.0      1.05          NaN
3  04/01/2010   -0.97  1.0      1.05          NaN
4  05/01/2010    1.12  1.0      1.05   05/01/2010
5  06/01/2010    1.08  1.0      1.05          NaN
6  07/01/2010    1.03  1.0      1.05          NaN
7  08/01/2010    1.07  1.0      1.05          NaN <- above threshold, but trigger occurred within last 30 days so don't return date
...
50 20/02/2010    1.12  1.0      1.05          20/02/2010 <- more than 30 days later, no trigger dates in between, so return date

1 个答案:

答案 0 :(得分:0)

对行上方的值使用numpy.whereshift

m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']

df['Trigger Date'] = np.where(m1 & m2, df['Date'], np.nan)
print (df)
         Date  Values  Avg  +1 Stdev Trigger Date
0  01/01/2010    1.01  1.0      1.05          NaN
1  02/01/2010    1.02  1.0      1.05          NaN
2  03/01/2010    1.04  1.0      1.05          NaN
3  04/01/2010   -0.97  1.0      1.05          NaN
4  05/01/2010    1.12  1.0      1.05   05/01/2010
5  06/01/2010    1.08  1.0      1.05          NaN

编辑:

df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

m1 = df['Values'] > df['+1 Stdev']
m2 = df['Values'].shift() < df['+1 Stdev']
a = df['Date'] - pd.Timedelta(30, unit='d')
L = [df['Date'].shift(-1).isin(pd.date_range(x, y, freq='d')) for x, y in zip(a, df['Date'] )]
m3 = np.logical_or.reduce(L)

mask = (m1 & m2) | ~m3

df.loc[mask, 'Trigger Date'] = df['Date']
print (df)
        Date  Values  Avg  +1 Stdev Trigger Date
0 2010-01-01    1.01  1.0      1.05          NaT
1 2010-01-02    1.02  1.0      1.05          NaT
2 2010-01-03    1.04  1.0      1.05          NaT
3 2010-01-04   -0.97  1.0      1.05          NaT
4 2010-01-05    1.12  1.0      1.05   2010-01-05
5 2010-01-06    1.08  1.0      1.05          NaT
6 2010-02-20    1.12  1.0      1.05   2010-02-20