我有一个结构如下的数据框:
df_all:
day_time LCLid energy(kWh/hh)
2014-02-08 23:00:00 MAC000006 0.077
2014-02-08 23:30:00 MAC000006 0.079
...
2014-02-08 23:00:00 MAC000007 0.045
...
我要用先前值和尾随值填充的数据中缺少四个连续的日期时间(在所有LCLid上)。
如果将数据帧拆分为子数据帧(df),则每个LCLid对应一个子帧,例如:
gb = df.groupby('LCLid')
df_list = [gb.get_group(x) for x in gb.groups]
然后我可以对df_list中的每个df执行此操作:
#valid data before gap
prev_row = df.loc['2013-09-09 22:30:00'].copy()
#valid data after gap
post_row = df.loc['2013-09-10 01:00:00'].copy()
df.loc[pd.to_datetime('2013-09-09 23:00:00')] = prev_row
df.loc[pd.to_datetime('2013-09-09 23:30:00')] = prev_row
df.loc[pd.to_datetime('2013-09-10 00:00:00')] = post_row
df.loc[pd.to_datetime('2013-09-10 00:30:00')] = post_row
df = df.sort_index()
我该如何在df_all上执行此操作,以便仅从每个LCLid中用“有效”数据填充丢失的数据?
答案 0 :(得分:1)
输入DataFrame:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
您需要做什么:
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
df = (
df
.groupby('LCLid', as_index=False)
.apply(lambda group: group.reindex(full_idx, method='nearest'))
.reset_index(level=0, drop=True)
.sort_index()
)
结果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
import numpy as np
import pandas as pd
# Building an example DataFrame that looks like yours
df = pd.DataFrame({
'day_time': [
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 0),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 0, 30),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 0),
pd.Timestamp(2014, 1, 1, 3, 30),
pd.Timestamp(2014, 1, 1, 3, 30),
],
'LCLid': [
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
'MAC000006',
'MAC000007',
],
'energy(kWh/hh)': np.random.rand(8)
},
).set_index('day_time')
结果:
LCLid energy(kWh/hh)
day_time
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
注意我们如何缺少以下时间戳记:
2014-01-01 01:00:00
2014-01-01 01:30:00
2014-01-02 02:00:00
2014-01-02 02:30:00
首先要了解的是,df.reindex()
允许您填写缺失的索引值,对于缺失的值,默认值为NaN
。对于您的情况,您希望提供完整的时间戳范围索引,包括未在起始DataFrame中显示的值。
在这里,我使用pd.date_range()
列出了您的最小起始索引值和最大起始索引值之间的所有时间戳,跨度为30分钟。 警告:这种方式意味着,如果您丢失的时间戳记值是在开头或结尾,则您不会将其重新添加!因此,也许您想显式指定start
和end
。
full_idx = pd.date_range(start=df.index.min(), end=df.index.max(), freq='30T')
结果:
DatetimeIndex(['2014-01-01 00:00:00', '2014-01-01 00:30:00',
'2014-01-01 01:00:00', '2014-01-01 01:30:00',
'2014-01-01 02:00:00', '2014-01-01 02:30:00',
'2014-01-01 03:00:00', '2014-01-01 03:30:00'],
dtype='datetime64[ns]', freq='30T')
现在,如果我们使用它重新索引您分组的子DataFrame之一,我们将得到以下信息:
grouped_df = df[df.LCLid == 'MAC000006']
grouped_df.reindex(full_idx)
结果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 NaN NaN
2014-01-01 01:30:00 NaN NaN
2014-01-01 02:00:00 NaN NaN
2014-01-01 02:30:00 NaN NaN
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
您说过要使用最接近的可用周围值来填充缺失值。可以在重新编制索引期间完成此操作,如下所示:
grouped_df.reindex(full_idx, method='nearest')
结果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:30:00 MAC000006 0.688879
现在,我们希望将此转换应用于DataFrame中的每个组,其中
组由其LCLid
定义。
(
df
.groupby('LCLid', as_index=False) # use LCLid as groupby key, but don't add it as a group index
.apply(lambda group: group.reindex(full_idx, method='nearest')) # do this for each group
.reset_index(level=0, drop=True) # get rid of the automatic index generated during groupby
.sort_index() # This is optional, just in case you want timestamps in chronological order
)
结果:
LCLid energy(kWh/hh)
2014-01-01 00:00:00 MAC000006 0.270453
2014-01-01 00:00:00 MAC000007 0.170603
2014-01-01 00:30:00 MAC000006 0.716418
2014-01-01 00:30:00 MAC000007 0.276678
2014-01-01 01:00:00 MAC000006 0.716418
2014-01-01 01:00:00 MAC000007 0.276678
2014-01-01 01:30:00 MAC000006 0.716418
2014-01-01 01:30:00 MAC000007 0.276678
2014-01-01 02:00:00 MAC000006 0.819146
2014-01-01 02:00:00 MAC000007 0.027490
2014-01-01 02:30:00 MAC000006 0.819146
2014-01-01 02:30:00 MAC000007 0.027490
2014-01-01 03:00:00 MAC000006 0.819146
2014-01-01 03:00:00 MAC000007 0.027490
2014-01-01 03:30:00 MAC000006 0.688879
2014-01-01 03:30:00 MAC000007 0.868017
相关文档:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.apply.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_index.html