为了最好地说明,请考虑以下SQL插图: 表StockPrices,BarSeqId是一个序号,其中每个增量都是来自下一分钟交易的信息。
在pandas数据框中实现的目标是转换这些数据:
StockPrice BarSeqId LongProfitTarget
105 0 109
100 1 105
103 2 107
103 3 108
104 4 110
105 5 113
进入这个数据:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
106 0 109 Nan
100 1 105 3
103 2 107 5
105 3 108 Nan
104 4 110 Nan
107 5 113 Nan
创建一个新列,其中描述了当前时间范围内未来将触及价格目标的最快顺序时间范围
以下是在SQL中实现的方法:
SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget,
min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and
S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget
我希望答案如下:
someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**
假设someDataFrame已经有了列:StockPrice,BarSeqId,LongProfitTarget
编辑数据以说明案例 所以在第二行结果应该是
100 1 105 3
而不是
100 1 105 0
因为3而不是0后发生。
重要的是,所讨论的bareq将来会发生(大于当前的BarSeq)
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
答案 0 :(得分:0)
这是一个解决方案:
import pandas as pd
import numpy as np
df = <your input data frame>
def get_barseqid(longProfitTarget):
try:
idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)
输出:
StockPrice BarSeqId LongProfitTarget TargetHitBarSeqId
0 100 1 105 3.0
1 103 2 107 5.0
2 105 3 108 NaN
3 104 4 110 NaN
4 107 5 113 NaN
答案 1 :(得分:0)
from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np
df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
try:
idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
return df.iloc[idx].BarSeqId
except:
return np.nan
df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df
对我来说,关键的误解是需要使用&amp;运算符而不是常规'或'
答案 2 :(得分:0)
假设数据是可管理的,请考虑交叉连接,然后是过滤器和groupby
,它将复制SQL查询:
cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
.query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
.groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()
print(cdf)
# StockPrice BarSeqId LongProfitTarget
# 100 1 105 3
# 103 2 107 5
# Name: BarSeqId_, dtype: int64