在熊猫中向df添加缺少的时间戳行

时间:2020-07-30 09:18:13

标签: python pandas datetime

我有非常不寻常的时间序列数据,这些数据既不规则,又有几个缺失值。 数据点仅在工作日(每天10:00 AM、2:00PM和6:00 PM)每天进行3次测量,大多数日子都缺少一两次测量,而有些日子则完全缺失。

我的 df 看起来像这样:

      date time            | value 
0     2020-07-30 10:00:00      5 
1     2020-07-30 14:00:00      3 
2     2020-07-31 10:00:00      6 
3     2020-07-31 14:00:00     4.5 
4     2020-07-31 18:00:00      7 
5     2020-08-03 14:00:00     5.5 
6     2020-08-04 14:00:00      5 

我正在尝试找出如何为缺少的测量值添加时间戳,添加具有丢失的时间戳和NA值的行,但又不添加一天中的任何额外时间或任何星期六或星期日,这样我的 df 最终会像这样:

      date time            | value 
0     2020-07-30 10:00:00      5 
1     2020-07-30 14:00:00      3 
2     2020-07-30 18:00:00      NA  
3     2020-07-31 10:00:00      6 
4     2020-07-31 14:00:00     4.5  
5     2020-07-31 18:00:00      7 
6     2020-08-03 10:00:00      NA 
7     2020-08-03 14:00:00     5.5 
8     2020-08-03 18:00:00      NA
9     2020-08-04 10:00:00      NA  
10    2020-08-04 14:00:00      5 
11    2020-08-04 18:00:00      NA 

我唯一能想到的就是费解了:编写一个循环以为所需日期范围* 3(每次测量为1)中的所有日期生成一行,并格式化为日期时间。额外的星期几计数器。将其转换为df,然后将“星期几” = 6,7的所有列删除,然后将这个新df与我在日期时间列上的原始df进行连接(外部或左侧-保留所有列的任何一个)。 / p>

还有其他更优雅的方式吗?

2 个答案:

答案 0 :(得分:1)

df = pd.DataFrame([
{"date time": datetime.datetime.strptime("2020-07-30 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
{"date time": datetime.datetime.strptime("2020-07-30 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 3},
{"date time": datetime.datetime.strptime("2020-07-31 10:00:00", '%Y-%m-%d %H:%M:%S'), "value": 6},
{"date time": datetime.datetime.strptime("2020-07-31 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 4.5},
{"date time": datetime.datetime.strptime("2020-07-31 18:00:00", '%Y-%m-%d %H:%M:%S'), "value": 7},
{"date time": datetime.datetime.strptime("2020-08-02 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5.5},
{"date time": datetime.datetime.strptime("2020-08-03 14:00:00", '%Y-%m-%d %H:%M:%S'), "value": 5},
    ]
)
# define your range of dates you're working with
range_dates = pd.date_range('2020-07-30', '2020-08-04', freq='D')
# remove weekend days
range_dates = range_dates[~range_dates.weekday.isin([5,6])]
range_dates = pd.Series(range_dates)

# here we will create a range of your 3 hours of measurements
range_times = pd.date_range('10:00:00', '18:00:00', freq='4H')
range_times = pd.Series(range_times.time)

# we combine our two ranges
index = range_dates.apply(
    lambda date: range_times.apply(
        lambda time: datetime.datetime.combine(date, time)
        )
    ).unstack()

# we reindex the dataframe and sort it
df = df.reindex(index=index).sort_index()

输出:

                     value
2020-07-30 10:00:00    5.0
2020-07-30 14:00:00    3.0
2020-07-30 18:00:00    NaN
2020-07-31 10:00:00    6.0
2020-07-31 14:00:00    4.5
2020-07-31 18:00:00    7.0
2020-08-01 10:00:00    NaN
2020-08-01 14:00:00    NaN
2020-08-01 18:00:00    NaN
2020-08-02 10:00:00    NaN
2020-08-02 14:00:00    5.5
2020-08-02 18:00:00    NaN
2020-08-03 10:00:00    NaN
2020-08-03 14:00:00    5.0
2020-08-03 18:00:00    NaN
2020-08-04 10:00:00    NaN
2020-08-04 14:00:00    NaN
2020-08-04 18:00:00    NaN

答案 1 :(得分:1)

您可以通过它创建过滤的日期范围和索引:

all_ts = pd.date_range(start=df['datetime'].min(), end=df['datetime'].max(), freq='H')
weekday_ts = all_ts[~all_ts.weekday.isin([5,6])]
filtered_ts = weekday_ts[weekday_ts.hour.isin([10, 14, 18])]
df.set_index(df['datetime']).reindex(filtered_ts).drop('datetime', axis=1).reset_index()