Question

我有一个时间序列pandas.DataFrame，＆＃39; ES_Summary_Index1＆＃39;，如下所示：

     Ticker_x                Date  Close_x 15M_Long 1H_Long Net_Long
0       ES H7 2016-10-18 13:44:59  2128.00        N     NaN         
1       ES H7 2016-10-18 13:59:59  2128.75        N     NaN         
2       ES H7 2016-10-18 14:14:59  2125.75        N     NaN         
3       ES H7 2016-10-18 14:29:59  2126.50        N       N         
4       ES H7 2016-10-18 14:44:59  2126.50        N     NaN         
5       ES H7 2016-10-18 16:14:59  2126.00        N     NaN         
6       ES H7 2016-10-18 16:44:59  2126.25        N     NaN         
7       ES H7 2016-10-18 17:59:59  2126.50        N     NaN         
8       ES H7 2016-10-18 18:14:59  2127.00        N     NaN         
9       ES H7 2016-10-18 19:14:59  2127.75        N     NaN         
10      ES H7 2016-10-18 19:44:59  2127.75        N     NaN         
11      ES H7 2016-10-18 19:59:59  2127.75        N     NaN         
12      ES H7 2016-10-18 20:44:59  2129.00        N     NaN         
13      ES H7 2016-10-18 21:29:59  2128.75        N       N         
14      ES H7 2016-10-18 21:44:59  2129.00        N     NaN

关注15M_Long和1H_Long列，如果两者都说'Y＆＃39;我希望Net_Long列也说Long。如果只有一个或两个都不说'Y＆＃39;那么我希望Net_Long列保持空白或说'＃34; N＆＃34; （取）。

首先，我将Net_Long列设置为空白：

ES_Summary_Index1['Net_Long'] = ''

接下来，我写了一个for循环语句来填充Net_Long列：

for index, row in ES_Summary_Index1.iterrows():
    if ES_Summary_Index1.loc[index, '15M_Long'] is 'Y' & ES_Summary_Index1.loc[index, '1H_Long'] is 'Y':
        ES_Summary_Index1.loc['Net_Long'] = 'Long'
    else:
        ES_Summary_Index1.loc['Net_Long'] = 'N'

不幸的是，我收到以下错误：

TypeError: unsupported operand type(s) for &: 'str' and 'float'

...引用上面的if语句（如果ES_Summary_Index1 ...）。我尝试过从&更改为and，但这并不像我想的那样填充Net_Long列。我也试过==而不是，而且不起作用。有人可以帮忙吗？

Answer 1

你需要使用布尔掩码快速矢量化numpy.where：

mask = (df['15M_Long'] == 'Y') & (df['1H_Long'] == 'Y')
df['Net_Long'] = np.where(mask, 'Long', 'N')

print (df)
  Ticker_x                 Date  Close_x 15M_Long 1H_Long Net_Long
0    ES_H7  2016-10-18T13:44:59  2128.00        N     NaN        N
1    ES_H7  2016-10-18T13:59:59  2128.75        N     NaN        N
2    ES_H7  2016-10-18T19:59:59  2127.75        Y     NaN        N
3    ES_H7  2016-10-18T20:44:59  2129.00        N       Y        N
4    ES_H7  2016-10-18T21:29:59  2128.75        Y       Y     Long
5    ES_H7  2016-10-18T21:44:59  2129.00        N     NaN        N

<强>计时：

#length of df is 600 rows
In [183]: %timeit (iterate(df))
10 loops, best of 3: 67.1 ms per loop

In [184]: %timeit (vectorize(df1))
1000 loops, best of 3: 1.49 ms per loop

#length of df is 6000 rows
In [177]: %timeit (iterate(df))
1 loop, best of 3: 681 ms per loop

In [178]: %timeit (vectorize(df1))
100 loops, best of 3: 3.23 ms per loop

#length of df is 60000 rows 
In [180]: %timeit (iterate(df))
1 loop, best of 3: 6.87 s per loop

In [181]: %timeit (vectorize(df1))
10 loops, best of 3: 20.8 ms per loop

时间安排的代码：

data = [x.strip().split() for x in """
    Ticker_x             Date  Close_x 15M_Long 1H_Long
    ES_H7 2016-10-18T13:44:59  2128.00        N     NaN
    ES_H7 2016-10-18T13:59:59  2128.75        N     NaN
    ES_H7 2016-10-18T19:59:59  2127.75        Y     NaN
    ES_H7 2016-10-18T20:44:59  2129.00        N       Y
    ES_H7 2016-10-18T21:29:59  2128.75        Y       Y
    ES_H7 2016-10-18T21:44:59  2129.00        N     NaN
""".split('\n')[1:-1]]
df = pd.DataFrame(data=data[1:], columns=data[0])
#for 600 rows * 100, 6000 rows *1000, 60k * 10000
df = pd.concat([df]*1000).reset_index(drop=True)
print (df)
df1 = df.copy()

def vectorize(df):
    mask = (df['15M_Long'] == 'Y') & (df['1H_Long'] == 'Y')
    df['Net_Long'] = np.where(mask, 'Long', 'N')
    return (df)

def iterate(df):
    df['Net_Long'] = ''

    for _, row in df.iterrows():
        if row['15M_Long'] is 'Y' and row['1H_Long'] is 'Y':
            row['Net_Long'] = 'Long'
        else:
            row['Net_Long'] = 'N'
    return df

print (iterate(df)) 
print (vectorize(df1))

Answer 2

替换以下行：

if ES_Summary_Index1.loc[index, '15M_Long'] is 'Y' & ES_Summary_Index1.loc[index, '1H_Long'] is 'Y':

与

if ES_Summary_Index1.loc[index, '15M_Long']=='Y' and ES_Summary_Index1.loc[index, '1H_Long']=='Y':

Answer 3

除了获得正确的逻辑测试外，您还应该直接访问该行。您的当前代码每次都通过循环重置整个列：

<强>代码：

df['Net_Long'] = ''

for _, row in df.iterrows():
    if row['15M_Long'] is 'Y' and row['1H_Long'] is 'Y':
        row['Net_Long'] = 'Long'
    else:
        row['Net_Long'] = 'N'

测试数据：

import pandas as pd

data = [x.strip().split() for x in """
    Ticker_x             Date  Close_x 15M_Long 1H_Long
    ES_H7 2016-10-18T13:44:59  2128.00        N     NaN
    ES_H7 2016-10-18T13:59:59  2128.75        N     NaN
    ES_H7 2016-10-18T19:59:59  2127.75        Y     NaN
    ES_H7 2016-10-18T20:44:59  2129.00        N       Y
    ES_H7 2016-10-18T21:29:59  2128.75        Y       Y
    ES_H7 2016-10-18T21:44:59  2129.00        N     NaN
""".split('\n')[1:-1]]
df = pd.DataFrame(data=data[1:], columns=data[0])

<强>产地：

  Ticker_x                 Date  Close_x 15M_Long 1H_Long Net_Long
0    ES_H7  2016-10-18T13:44:59  2128.00        N     NaN        N
1    ES_H7  2016-10-18T13:59:59  2128.75        N     NaN        N
2    ES_H7  2016-10-18T19:59:59  2127.75        Y     NaN        N
3    ES_H7  2016-10-18T20:44:59  2129.00        N       Y        N
4    ES_H7  2016-10-18T21:29:59  2128.75        Y       Y     Long
5    ES_H7  2016-10-18T21:44:59  2129.00        N     NaN        N

Pandas For循环错误 - 嵌入了和/ if语句

3 个答案: