如何根据条件和groupby将pandas数据集合并到自身上

时间:2018-05-07 05:54:31

标签: python pandas

为了最好地说明,请考虑以下SQL插图: 表StockPrices,BarSeqId是一个序号,其中每个增量都是来自下一分钟交易的信息。

在pandas数据框中实现的目标是转换这些数据:

StockPrice    BarSeqId LongProfitTarget
105           0           109
100           1           105
103           2           107
103           3           108
104           4           110
105           5           113

进入这个数据:

StockPrice    BarSeqId    LongProfitTarget  TargetHitBarSeqId
106           0           109               Nan
100           1           105               3 
103           2           107               5
105           3           108               Nan
104           4           110               Nan
107           5           113               Nan

创建一个新列,其中描述了当前时间范围内未来将触及价格目标的最快顺序时间范围

以下是在SQL中实现的方法:

SELECT S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget, 
   min(S2.BarSeqId) as TargetHitBarSeqId
FROM StockPrices S1
   left outer join StockPrices S2 on S1.BarSeqId<s2.BarSeqId and 
  S2.StockPrice>=S1.LongProfitTarget
GROUP BY S1.StockPrice, S1.BarSeqId, S1.LongProfitTarget

我希望答案如下:

 someDataFrame['TargetHitBarSeqId'] = (pandas expression here ...**

假设someDataFrame已经有了列:StockPrice,BarSeqId,LongProfitTarget

编辑数据以说明案例 所以在第二行结果应该是

100           1           105               3 

而不是

100           1           105               0 

因为3而不是0后发生。

重要的是,所讨论的bareq将来会发生(大于当前的BarSeq)

df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
          'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget):
    try:
        idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
        return df.iloc[idx].BarSeqId
    except:
        return np.nan

df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)

3 个答案:

答案 0 :(得分:0)

这是一个解决方案:

import pandas as pd
import numpy as np

df = <your input data frame>

def get_barseqid(longProfitTarget):
    try:
        idx = df.StockPrice[df.StockPrice >= longProfitTarget].index[0]
        return df.iloc[idx].BarSeqId
    except:
        return np.nan

df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget']), axis=1)

输出:

   StockPrice  BarSeqId  LongProfitTarget  TargetHitBarSeqId
0         100         1               105                3.0
1         103         2               107                5.0
2         105         3               108                NaN
3         104         4               110                NaN
4         107         5               113                NaN

答案 1 :(得分:0)

from pathlib import Path
import pandas as pd
from itertools import islice
import numpy as np

df = pd.DataFrame({'StockPrice':[105,100,103,105,104,107],'BarSeqId':[0,1,2,3,4,5],
              'LongProfitTarget':[109,105,107,108,110,113]})
def get_barseqid(longProfitTarget,barseq):
    try:
        idx = df[(df.StockPrice >= longProfitTarget) & (df.BarSeqId>barseq)].index[0]
        return df.iloc[idx].BarSeqId
    except:
        return np.nan

df['TargetHitBarSeqId'] = df.apply(lambda row: get_barseqid(row['LongProfitTarget'], row['BarSeqId']), axis=1)
df

对我来说,关键的误解是需要使用&amp;运算符而不是常规'或'

答案 2 :(得分:0)

假设数据是可管理的,请考虑交叉连接,然后是过滤器和groupby,它将复制SQL查询:

cdf = pd.merge(df.assign(key=1), df.assign(key=1), on='key', suffixes=['','_'])\
            .query('(BarSeqId < BarSeqId_) & (LongProfitTarget <= StockPrice_)')\
            .groupby(['StockPrice', 'BarSeqId', 'LongProfitTarget'])['BarSeqId_'].min()

print(cdf)
# StockPrice  BarSeqId  LongProfitTarget
# 100         1         105                 3
# 103         2         107                 5
# Name: BarSeqId_, dtype: int64