我有一个数据框df,如下所示:
my_df = my_df.set_index(['datetime', 'city'])
my_df = my_df.unstack(-1).resample('6H').pad()
my_df = my_df.stack().reset_index()
my_df = my_df[['city', 'datetime', 'value']]
my_df = my_df.sort_values(['city', 'datetime'])
我正在尝试以6h的频率(每天每00h,6h,12h和18h的数据)对每天的日期时间进行重新采样。
以下代码几乎给了我期望的输出
city datetime value
0 city_a 2020-07-10 00:00:00 2.0
1 city_a 2020-07-10 06:00:00 2.0
2 city_a 2020-07-10 12:00:00 2.0
3 city_a 2020-07-10 18:00:00 2.0
4 city_a 2020-07-11 00:00:00 5.0
5 city_b 2020-07-11 00:00:00 4.0
输出:
city datetime value
0 city_a 2020-07-10 00:00:00 2.0
1 city_a 2020-07-10 06:00:00 2.0
2 city_a 2020-07-10 12:00:00 2.0
3 city_a 2020-07-10 18:00:00 2.0
4 city_a 2020-07-11 00:00:00 5.0
6 city_a 2020-07-11 06:00:00 5.0
8 city_a 2020-07-11 12:00:00 5.0
10 city_a 2020-07-11 18:00:00 5.0
5 city_b 2020-07-11 00:00:00 4.0
7 city_b 2020-07-11 06:00:00 4.0
9 city_b 2020-07-11 12:00:00 4.0
11 city_b 2020-07-11 18:00:00 4.0
但是,我们可以看到2020-07-11天尚未完成。我希望包括2020-07-11 06:00:00、12:00:00和18:00:00的行出现在输出中。
所以我的预期输出应该是:
my_df = pd.DataFrame(data = {
'city': ['city_a', 'city_a', 'city_b'],
'datetime':
[pd.to_datetime('2020/07/10'),pd.to_datetime('2020/07/11'),pd.to_datetime('2020/07/11')],
'value': [2,5,4]
})
有没有一种优雅的方法可以对付熊猫?
生成数据框的代码:
LD_LIBRARY_PATH=/home/vcap/app/oracle/instantclient:${LD_LIBRARY_PATH:-}
export OCI_LIB_DIR=/home/vcap/app/oracle/instantclient
export OCI_INC_DIR=/home/vcap/app/oracle/instantclient/sdk/include
export PYTHONPATH=/home/vcap/app/vendor:$PYTHONPATH
export LD_LIBRARY_PATH=/lib:/usr/lib:/usr/local/lib
答案 0 :(得分:5)
使用:
# STEP A
df1 = (df.groupby('city')['datetime'].max() + pd.Timedelta(days=1)).reset_index()
# STEP B
df1 = pd.concat([df, df1]).set_index('datetime')
# STEP C
df1 = df1.groupby('city', as_index=False).resample('6H').ffill()
# STEP D
df1 = df1.reset_index().drop('level_0', 1).dropna(subset=['value'])
详细信息:
步骤A:使用DataFrame.groupby
对city
上的数据框进行分组以确定每个组中日期的最大值,并将1 day
添加到每个组中的最大值,这将是必需的用于重新采样数据框。
# print(df1)
city datetime
0 city_a 2020-07-12
1 city_b 2020-07-12
步骤B:使用pd.concat
将原始数据帧df
连接到新创建的数据帧df1
,这是因为我们必须在STEP C中对数据帧进行重新采样。
# print(df1)
city value
datetime
2020-07-10 city_a 2.0
2020-07-11 city_a 5.0
2020-07-11 city_b 4.0
2020-07-12 city_a NaN
2020-07-12 city_b NaN
步骤C:使用DataFrame.resample
以city
的频率对在6H
上分组的数据帧进行重新采样,并使用ffill
向前填充值。
# print(df1)
city value
datetime
0 2020-07-10 00:00:00 city_a 2.0
2020-07-10 06:00:00 city_a 2.0
2020-07-10 12:00:00 city_a 2.0
2020-07-10 18:00:00 city_a 2.0
2020-07-11 00:00:00 city_a 5.0
2020-07-11 06:00:00 city_a 5.0
2020-07-11 12:00:00 city_a 5.0
2020-07-11 18:00:00 city_a 5.0
2020-07-12 00:00:00 city_a NaN
1 2020-07-11 00:00:00 city_b 4.0
2020-07-11 06:00:00 city_b 4.0
2020-07-11 12:00:00 city_b 4.0
2020-07-11 18:00:00 city_b 4.0
2020-07-12 00:00:00 city_b NaN
步骤D:最后使用DataFrame.reset_index
并在axis=1
处使用DataFrame.drop
删除未使用的列,还使用DataFrame.dropna
将具有NaN
值的行列value
。
# print(df1)
datetime city value
0 2020-07-10 00:00:00 city_a 2.0
1 2020-07-10 06:00:00 city_a 2.0
2 2020-07-10 12:00:00 city_a 2.0
3 2020-07-10 18:00:00 city_a 2.0
4 2020-07-11 00:00:00 city_a 5.0
5 2020-07-11 06:00:00 city_a 5.0
6 2020-07-11 12:00:00 city_a 5.0
7 2020-07-11 18:00:00 city_a 5.0
9 2020-07-11 00:00:00 city_b 4.0
10 2020-07-11 06:00:00 city_b 4.0
11 2020-07-11 12:00:00 city_b 4.0
12 2020-07-11 18:00:00 city_b 4.0
答案 1 :(得分:4)
我看到的唯一方法是添加一个空行,其日期时间等于最新的现有日期时间+一天。然后,您几乎可以执行完全相同的操作(枢轴是替换set_index并取消堆栈的便捷方法)。
# adding a row where datetime corresponds to the max datetime + 1 day
df.loc[len(df), 'datetime'] = df.datetime.max() + pd.Timedelta(days=1)
# pivot to replace set_index & unstack
df = (df.pivot(index='datetime', columns='city')
.resample('6H')
.pad(3)
.stack()
.reset_index()
.sort_values(['city', 'datetime']))
df[['city', 'datetime', 'value']]
city datetime value
0 city_a 2020-07-10 00:00:00 2.0
1 city_a 2020-07-10 06:00:00 2.0
2 city_a 2020-07-10 12:00:00 2.0
3 city_a 2020-07-10 18:00:00 2.0
4 city_a 2020-07-11 00:00:00 5.0
6 city_a 2020-07-11 06:00:00 5.0
8 city_a 2020-07-11 12:00:00 5.0
10 city_a 2020-07-11 18:00:00 5.0
5 city_b 2020-07-11 00:00:00 4.0
7 city_b 2020-07-11 06:00:00 4.0
9 city_b 2020-07-11 12:00:00 4.0
11 city_b 2020-07-11 18:00:00 4.0