我有非常不寻常的时间序列数据,这些数据既不规则,又有几个缺失值。 数据点仅在工作日(每天10:00 AM、2:00PM和6:00 PM)每天进行3次测量,大多数日子都缺少一两次测量,而有些日子则完全缺失。
我的 df 看起来像这样:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-31 10:00:00 6
3 2020-07-31 14:00:00 4.5
4 2020-07-31 18:00:00 7
5 2020-08-03 14:00:00 5.5
6 2020-08-04 14:00:00 5
我正在尝试找出如何为缺少的测量值添加时间戳,添加具有丢失的时间戳和NA值的行,但又不添加一天中的任何额外时间或任何星期六或星期日,这样我的 df 最终会像这样:
date time | value
0 2020-07-30 10:00:00 5
1 2020-07-30 14:00:00 3
2 2020-07-30 18:00:00 NA
3 2020-07-31 10:00:00 6
4 2020-07-31 14:00:00 4.5
5 2020-07-31 18:00:00 7
6 2020-08-03 10:00:00 NA
7 2020-08-03 14:00:00 5.5
8 2020-08-03 18:00:00 NA
9 2020-08-04 10:00:00 NA
10 2020-08-04 14:00:00 5
11 2020-08-04 18:00:00 NA
我唯一能想到的就是费解了:编写一个循环以为所需日期范围* 3(每次测量为1)中的所有日期生成一行,并格式化为日期时间。额外的星期几计数器。将其转换为df,然后将“星期几” = 6,7的所有列删除,然后将这个新df与我在日期时间列上的原始df进行连接(外部或左侧-保留所有列的任何一个)。 / p>
还有其他更优雅的方式吗?
答案 0 :(得分:1)
df = pd.DataFrame([
{"date time": datetime.datetime.strptime("2020-07-30 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
{"date time": datetime.datetime.strptime("2020-07-30 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 3},
{"date time": datetime.datetime.strptime("2020-07-31 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 6},
{"date time": datetime.datetime.strptime("2020-07-31 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 4.5},
{"date time": datetime.datetime.strptime("2020-07-31 18:00:00", '%Y-%m-%d %H:%M:%S'), "value": 7},
{"date time": datetime.datetime.strptime("2020-08-02 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5.5},
{"date time": datetime.datetime.strptime("2020-08-03 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
]
)
# define your range of dates you're working with
range_dates = pd.date_range('2020-07-30', '2020-08-04', freq='D')
# remove weekend days
range_dates = range_dates[~range_dates.weekday.isin([5,6])]
range_dates = pd.Series(range_dates)
# here we will create a range of your 3 hours of measurements
range_times = pd.date_range('10:00:00', '18:00:00', freq='4H')
range_times = pd.Series(range_times.time)
# we combine our two ranges
index = range_dates.apply(
lambda date: range_times.apply(
lambda time: datetime.datetime.combine(date, time)
)
).unstack()
# we reindex the dataframe and sort it
df = df.reindex(index=index).sort_index()
输出:
value
2020-07-30 10:00:00 5.0
2020-07-30 14:00:00 3.0
2020-07-30 18:00:00 NaN
2020-07-31 10:00:00 6.0
2020-07-31 14:00:00 4.5
2020-07-31 18:00:00 7.0
2020-08-01 10:00:00 NaN
2020-08-01 14:00:00 NaN
2020-08-01 18:00:00 NaN
2020-08-02 10:00:00 NaN
2020-08-02 14:00:00 5.5
2020-08-02 18:00:00 NaN
2020-08-03 10:00:00 NaN
2020-08-03 14:00:00 5.0
2020-08-03 18:00:00 NaN
2020-08-04 10:00:00 NaN
2020-08-04 14:00:00 NaN
2020-08-04 18:00:00 NaN
答案 1 :(得分:1)
您可以通过它创建过滤的日期范围和索引:
all_ts = pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(), freq='H')
weekday_ts = all_ts[~all_ts.weekday.isin([5,6])]
filtered_ts = weekday_ts[weekday_ts.hour.isin([10, 14, 18])]
df.set_index(df['datetime']).reindex(filtered_ts).drop('datetime', axis=1).reset_index()