Pandas根据结果检查最后N行的值,新列

时间:2017-11-04 17:16:11

标签: python pandas loops dataframe shift

我有一个DataFrame,Df2。我正在尝试检查下面的列Lead_Lag的最后10行中的每一行 - 如果在任何这些行中除了null之外还有任何值,那么我希望新列Position等于{{1 }}:

'Y'

数据样本如下:

def run_HG_AUDUSD_15M_Aggregate():
    Df1 = pd.read_csv(max(glob.iglob(r"C:\Users\cost9\OneDrive\Documents\PYTHON\Daily Tasks\Pairs Trading\HG_AUDUSD\CSV\15M\Lead_Lag\*.csv"), key=os.path.getctime))    
    Df2 = Df1[['Date', 'Close_HG', 'Close_AUDUSD', 'Lead_Lag']]

    Df2['Position'] = ''

    for index,row in Df2.iterrows():
        if Df2.loc[Df2.index.shift(-10):index,"Lead_Lag"].isnull(): 
            continue
        else:
            Df2.loc[index, 'Position'] = "Y"

因此,在这种情况下,我希望新列Date Close_HG Close_AUDUSD Lead_Lag 7/19/2017 12:59 2.7 0.7956 7/19/2017 13:59 2.7 0.7955 7/19/2017 14:14 2.7 0.7954 7/20/2017 3:14 2.7 0.791 7/20/2017 5:44 2.7 0.791 7/20/2017 7:44 2.71 0.7925 7/20/2017 7:59 2.7 0.7924 7/20/2017 8:44 2.7 0.7953 Short_Both 7/20/2017 10:44 2.71 0.7964 Short_Both 7/20/2017 11:14 2.71 0.7963 Short_Both 7/20/2017 11:29 2.71 0.7967 Short_Both 7/20/2017 13:14 2.71 0.796 Short_Both 7/20/2017 13:29 2.71 0.7956 Short_Both 7/20/2017 14:29 2.71 0.7957 Short_Both的最后两个值为Position,因为在'Y'列中至少有一个值中存在最后10个值行。我想在滚动的基础上应用它 - 例如第13行'位置'值将查看行12-3,行12'位置'值将查看行11-2等。

相反,我得到了错误:

Lead_Lag

我尝试了几种变换方法(在循环之前定义等)并且无法使其工作。

编辑:这是解决方案:

NotImplementedError: Not supported for type RangeIndex 

3 个答案:

答案 0 :(得分:2)

通过链接使用numpy.where和布尔掩码:

m = df["Lead_Lag"].notnull() & df.index.isin(df.index[-10:])

或者通过iloc按位置选择并按reindex添加False

m = df["Lead_Lag"].iloc[-10:].notnull().reindex(df.index, fill_value=False)
df['new'] = np.where(m, 'Y', '')

print (df)
               Date  Close_HG  Close_AUDUSD    Lead_Lag new
0   7/19/2017 12:59      2.70        0.7956         NaN    
1   7/19/2017 13:59      2.70        0.7955         NaN    
2   7/19/2017 14:14      2.70        0.7954         NaN    
3    7/20/2017 3:14      2.70        0.7910         NaN    
4    7/20/2017 5:44      2.70        0.7910         NaN    
5    7/20/2017 7:44      2.71        0.7925         NaN    
6    7/20/2017 7:59      2.70        0.7924         NaN    
7    7/20/2017 8:44      2.70        0.7953  Short_Both   Y
8   7/20/2017 10:44      2.71        0.7964  Short_Both   Y
9   7/20/2017 11:14      2.71        0.7963  Short_Both   Y
10  7/20/2017 11:29      2.71        0.7967  Short_Both   Y
11  7/20/2017 13:14      2.71        0.7960  Short_Both   Y
12  7/20/2017 13:29      2.71        0.7956  Short_Both   Y
13  7/20/2017 14:29      2.71        0.7957  Short_Both   Y

答案 1 :(得分:0)

这就是我最终做的事情:

def run_HG_AUDUSD_15M_Aggregate():


N = 10
Df2['Position'] = ''

for index,row in Df2.iterrows():
    if (Df2.loc[index-N:index,"Lead_Lag"] != "N").any():
        Df2.loc[index, 'Position'] = "Y"
    else:
        Df2.loc[index, 'Position'] = "N"

答案 2 :(得分:0)

示例:

np.random.seed(123)
M = 20
Df2 = pd.DataFrame({'Lead_Lag':np.random.choice([np.nan, 'N'], p=[.3,.7], size=M)})

解决方案1-熊猫:

说明:首先比较不等于Series.ne的布尔值Series列,然后将Series.rollingSeries.any用作窗口中的测试值-最后设置{{1} }和numpy.whereN

Y

另一个带有strides的numpy解决方案,并将前N个值更正为N = 3 a = (Df2['Lead_Lag'].ne('N') .rolling(N, min_periods=1) .apply(lambda x: x.any(), raw=False)) Df2['Pos1'] = np.where(a, 'Y','N') s:

False

比较输出:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

x = np.concatenate([[False] * (N - 1), Df2['Lead_Lag'].ne('N').values])
arr = np.any(rolling_window(x, N), axis=1)

Df2['Pos2'] = np.where(arr, 'Y','N')

numpy解决方案的详细信息:

为测试的前N -1个值添加print (Df2) Lead_Lag Pos1 Pos2 0 N N N 1 nan Y Y 2 nan Y Y 3 N Y Y 4 N Y Y 5 N N N 6 N N N 7 N N N 8 N N N 9 N N N 10 N N N 11 N N N 12 N N N 13 nan Y Y 14 N Y Y 15 N Y Y 16 nan Y Y 17 nan Y Y 18 N Y Y 19 N Y Y 值:

False

Stride返回2d布尔数组:

print (np.concatenate([[False] * (N - 1), Df2['Lead_Lag'].ne('N').values]))
[False False False  True  True False False False False False False False
 False False False  True False False  True  True False False]

通过numpy.any每行测试至少一个True:

print (rolling_window(x, N))
[[False False False]
 [False False  True]
 [False  True  True]
 [ True  True False]
 [ True False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False False]
 [False False  True]
 [False  True False]
 [ True False False]
 [False False  True]
 [False  True  True]
 [ True  True False]
 [ True False False]]

编辑:

如果使用print (np.any(rolling_window(x, N), axis=1)) [False True True True True False False False False False False False False True True True True True True True] 解决方案进行测试,则输出将不同。原因是此解决方案测试是在iterrows窗口中进行的,因此对于相同的输出,必须将N + 1添加到1

N

N = 3
Df2['Position'] = ''

for index,row in Df2.iterrows():
    #for check windows
    #print (Df2.loc[index-N:index,"Lead_Lag"])
    if (Df2.loc[index-N:index,"Lead_Lag"] != "N").any():
        Df2.loc[index, 'Position'] = "Y"
    else:
        Df2.loc[index, 'Position'] = "N"

a = (Df2['Lead_Lag'].ne('N')
                    .rolling(N + 1, min_periods=1)
                    .apply(lambda x: x.any(), raw=False)  )      
Df2['Pos1'] = np.where(a, 'Y','N')

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

x = np.concatenate([[False] * (N), Df2['Lead_Lag'].ne('N').values])
arr = np.any(rolling_window(x, N + 1), axis=1)

Df2['Pos2'] = np.where(arr, 'Y','N')