我是Python新手,具有以下pandas数据框-我正在尝试编写填充“信号”列的代码,如下所示:
Days long_entry_flag long_exit_flag signal
1 FALSE TRUE
2 FALSE FALSE
3 TRUE FALSE 1
4 TRUE FALSE 1
5 FALSE FALSE 1
6 TRUE FALSE 1
7 TRUE FALSE 1
8 FALSE TRUE
9 FALSE TRUE
10 TRUE FALSE 1
11 TRUE FALSE 1
12 TRUE FALSE 1
13 FALSE FALSE 1
14 FALSE TRUE
15 FALSE FALSE
16 FALSE TRUE
17 TRUE FALSE 1
18 TRUE FALSE 1
19 FALSE FALSE 1
20 FALSE FALSE 1
21 FALSE TRUE
22 FALSE FALSE
23 FALSE FALSE
我的pseudo-code
版本将执行以下步骤
关于在可能的情况下快速填充“信号”列的方法的欢迎想法(使用矢量化?)-这是具有数万行的大型数据框的子集,并且是按顺序分析的许多数据框之一。
非常感谢!
答案 0 :(得分:7)
你可以
# Assuming we're starting from the "outside"
inside = False
for ix, row in df.iterrows():
inside = (not row['long_exit_flag']
if inside
else row['long_entry_flag']
and not row['long_exit_flag']) # [True, True] case
df.at[ix, 'signal'] = 1 if inside else np.nan
这将完全为您提供您发布的输出。
受@jezrael's answer的启发,我创建了上述功能的性能稍强的版本,同时仍在努力使之保持整洁:
# Same assumption of starting from the "outside"
df.at[0, 'signal'] = df.at[0, 'long_entry_flag']
for ix in df.index[1:]:
df.at[ix, 'signal'] = (not df.at[ix, 'long_exit_flag']
if df.at[ix - 1, 'signal']
else df.at[ix, 'long_entry_flag']
and not df.at[ix, 'long_exit_flag']) # [True, True] case
# Adjust to match the requested output exactly
df['signal'] = df['signal'].replace([True, False], [1, np.nan])
答案 1 :(得分:5)
为提高性能,请使用Numba解决方案:
arr = df[['long_exit_flag','long_entry_flag']].values
@jit
def f(A):
inside = False
out = np.ones(len(A), dtype=float)
for i in range(len(arr)):
inside = not A[i, 0] if inside else A[i, 1]
if not inside:
out[i] = np.nan
return out
df['signal'] = f(arr)
性能:
#[21000 rows x 5 columns]
df = pd.concat([df] * 1000, ignore_index=True)
In [189]: %%timeit
...: inside = False
...: for ix, row in df.iterrows():
...: inside = not row['long_exit_flag'] if inside else row['long_entry_flag']
...: df.at[ix, 'signal'] = 1 if inside else np.nan
...:
1.58 s ± 9.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [190]: %%timeit
...: arr = df[['long_exit_flag','long_entry_flag']].values
...:
...: @jit
...: def f(A):
...: inside = False
...: out = np.ones(len(A), dtype=float)
...: for i in range(len(arr)):
...: inside = not A[i, 0] if inside else A[i, 1]
...: if not inside:
...: out[i] = np.nan
...: return out
...:
...: df['signal'] = f(arr)
...:
171 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [200]: %%timeit
...: df['d'] = np.where(~df['long_exit_flag'],df['long_entry_flag'] | df['long_exit_flag'],np.nan)
...:
...: df['new_select']= np.where(df['d']==0, np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan), df['d'])
...:
2.4 ms ± 561 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
您还可以使用numpy进行移位,也可以简化@Dark代码:
In [222]: %%timeit
...: d = np.where(~df['long_exit_flag'].values, df['long_entry_flag'].values | df['long_exit_flag'].values, np.nan)
...: shifted = np.insert(d[:-1], 0, np.nan)
...: m = (shifted==0) | (shifted==1)
...: df['signal1']= np.select([d!=0, m], [d, 1], np.nan)
...:
590 µs ± 35.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
编辑:
您也可以检查Does iterrows have performance issues?以获得熊猫执行各种操作的一般优先顺序。
答案 2 :(得分:3)
这是一种具有完整布尔操作的方法,这是一种矢量化方法,并且很快。
第1步:
如果long_exit_flag为True,则返回Np.nan,否则在or
和long_entry_flag
long_exit_flag
df['d'] = np.where(df['long_exit_flag'], np.nan, df['long_entry_flag'] | df['long_exit_flag'])
步骤2 :现在是两列均为false
的状态。我们需要忽略它,并将值替换为先前的状态。可以使用where
和select
df['new_signal']= np.where(df['d']==0,
np.select([df['d'].shift()==0, df['d'].shift()==1],[1,1], np.nan),
df['d'])
Days long_entry_flag long_exit_flag signal d new_signal
0 1 False True NaN NaN NaN
1 2 False False NaN 0.0 NaN
2 3 True False 1.0 1.0 1.0
3 4 True False 1.0 1.0 1.0
4 5 False False 1.0 0.0 1.0
5 6 True False 1.0 1.0 1.0
6 7 True False 1.0 1.0 1.0
7 8 False True NaN NaN NaN
8 9 False True NaN NaN NaN
9 10 True False 1.0 1.0 1.0
10 11 True False 1.0 1.0 1.0
11 12 True False 1.0 1.0 1.0
12 13 False False 1.0 0.0 1.0
13 14 False True NaN NaN NaN
14 15 False False NaN 0.0 NaN
15 16 False True NaN NaN NaN
16 17 True False 1.0 1.0 1.0
17 18 True False 1.0 1.0 1.0
18 19 False False 1.0 0.0 1.0
19 20 False False 1.0 0.0 1.0
20 21 False True NaN NaN NaN
答案 3 :(得分:0)
#let the long_exit_flag equal to 0 when the exit is TRUE
df['long_exit_flag_r']=~df.long_exit_flag_r
df.temp=''
for i in range(1,len(df.index)):
df.temp[i]=(df.signal[i-1]+df.long_entry_flag[i])*df.long_exit_flag_r
如果温度为正,则信号应为1;如果温度为负,则信号应为空。 (我有点卡在这里)