Question

首先，对不起标题，我不确定该怎么说。这个问题并不复杂，因此应该很容易理解。我有两个数据框；第一个是这样的：

    df1
id    date
1    20190101
1    20190201
1    20190301
2    20180101
2    20180301
2    20180401

该行有1000万行，因此这就是效率很重要的原因。然后，我有另一个数据框，如下所示：

       df2
id    date      price
1    20180801    150
1    20181001    140
1    20190201    100
2    20180301    90
2    20180401    120

因此，对于df1中的每一行，我需要获取在id和date之间列出的date-6个月的最高价格。因此，在这种情况下，我的理想输出是：

    df1
id    date       price
1    20190101     150
1    20190201     150
1    20190301     140 #The first row of df2 was 7 months ago, so max(price) is 140.
2    20180101     nan #There's no price between 20180101 and 20170601
2    20180301     90
2    20180401     140

我通过一个函数和一个apply实现了这一目标，但是花费了30多分钟。要点是：

def get_max_price(date,id,df2):

    min_date = (date - pd.DateOffset(months=6))
    aux = df2.loc[(df2['id']==id)&
                    (df2['date'].between(min_date,date))]


    if len(aux)>0:
        return aux.price.max()
    else:
        return np.nan

然后

df.apply(lambda x: get_max_price(x['date'],x['id'],df2),axis=1)

还有更好的方法吗？也许是一些矢量化操作或某种合并？谢谢！

如何查询数据框以获取每一行上另一个数据框的值

0 个答案: