Question

我有数据框（python3.5，electricity_use），日期为索引。 City Country electricity_use DATE 7/1/2014 X A 1.02 7/1/2014 Y A 0.25 7/2/2014 X A 1.21 7/2/2014 Y A 0.27 7/3/2014 X A 1.25 7/3/2014 Y A 0.20 7/4/2014 X A 0.97 7/4/2014 Y A 0.43 7/5/2014 X A 0.54 7/5/2014 Y A 0.45 7/6/2014 X A 1.33 7/6/2014 Y A 0.55 7/7/2014 X A 2.01 7/7/2014 Y A 0.21 7/8/2014 X A 1.11 7/8/2014 Y A 0.34 7/9/2014 X A 1.35 7/9/2014 Y A 0.18 7/10/2014 X A 1.22 7/10/2014 Y A 0.27是我应该预测的标签例如

electricity_use

当然数据更大。
我的目标是为每一行创建最后3 'City' 'country'组（City Country electricity_use prev_1 prev_2 prev_3 DATE 7/10/2014 X A 1.22 0.54 0.97 1.25 7/10/2014 Y A 0.27 0.45 0.43 0.20），间隔为5天（即 - 从5天后取最后3个值）。日期可以是非连续的，但是它们是有序的。
例如，对于最后两行，结果应为：

7/10/2014

因为日期为5 days，差距为7/5/2014，所以我们从(X,A)开始查看这些日期中的最后3个值到每个组（在此例如，这些组是(Y,A)和{{1}}。

我实现了一个遍历每个组的循环，但我感觉它可以以更有效的方式完成。

Answer 1

这样做的一种天真的方法是重新索引数据帧并迭代合并n次

from datetime import datetime,timedelta

# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()

for i in range(3):
    df1['index'] = df['index'] - timedelta(5+i)
    df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))

更快的方法是使用shift by并删除伪值

df['date'] = df.index

df.sort_values(by=['City','Country','date'],inplace=True)

temp = df[['City','Country','date']].groupby(['City','Country']).first()

# To pick the oldest date of every city + country group

df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))

df['diff_date'] = df['date'] - df['date_first']

df.diff_date = [int(i.days) for i in df['diff_date']]

# Do a shift by 5
for i range(5,8):
    df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
    df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Pandas - 从具有偏移量的组中获取最后n个值。

1 个答案: