为数据框中缺失的行添加未来日期

时间:2021-02-17 10:25:38

标签: python python-3.x pandas dataframe

如何用数据框中的下一个日期来估算错过的日期?

wtg_at1.tail(10)
<头>
环境温度 日期
818 31.237499 2020-03-28
819 32.865974 2020-03-29
820 32.032558 2020-03-30
821 31.671166 NaN
822 31.389927 NaN
823 31.243660 NaN
824 31.206777 NaN
825 31.241503 NaN
826 31.309531 NaN
827 31.382531 NaN

我期待我的输出数据框类似于下面的内容。 3 月 30 日之后,我期待下一个日期是 3 月 31 日。

<头>
环境温度 日期
818 31.237499 2020-03-28
819 32.865974 2020-03-29
820 32.032558 2020-03-30
821 31.671166 2020-03-31
822 31.389927 2020-04-01
823 31.243660 2020-04-02
824 31.206777 2020-04-03
825 31.241503 2020-04-04
826 31.309531 2020-04-05
827 31.382531 2020-04-06

我尝试了下面的代码,但没有给出想要的输出。

wtg_at1.append(pd.DataFrame({'Date': pd.date_range(start=wtg_at1.Date.iloc[-8], periods=7, freq='D', closed='right')}))
wtg_at1
<头>
环境温度 日期
0 32.032558 2017-12-31
1 26.667757 2018-01-01
2 25.655754 2018-01-02
3 25.514013 2018-01-03
4 24.927652 2018-01-04
... ... ...
823 31.243660 NaN
824 31.206777 NaN
825 31.241503 NaN
826 31.309531 NaN
827 31.382531 NaN

1 个答案:

答案 0 :(得分:1)

如果只有一组缺失值,可以向前填充它们并通过转换为天时间增量的累积和添加计数器:

df['Date'] = pd.to_datetime(df['Date'])

df['Date'] = df['Date'].ffill() + pd.to_timedelta(df['Date'].isna().cumsum(), unit='d')
print (df)
     AmbientTemperatue       Date
818          31.237499 2020-03-28
819          32.865974 2020-03-29
820          32.032558 2020-03-30
821          31.671166 2020-03-31
822          31.389927 2020-04-01
823          31.243660 2020-04-02
824          31.206777 2020-04-03
825          31.241503 2020-04-04
826          31.309531 2020-04-05
827          31.382531 2020-04-06

另一个可能的想法是通过DataFrame的最小日期时间和长度重新分配值:

df['Date'] = pd.date_range(df['Date'].min(), periods=len(df))

如果存在多个缺失值的组:

print (df)
     AmbientTemperatue        Date
818          31.237499  2020-03-28
819          32.865974  2020-03-29
820          32.032558  2020-03-30
821          31.671166         NaN
822          31.389927         NaN
823          31.243660         NaN
824          31.206777  2020-05-08
825          31.241503         NaN
826          31.309531         NaN
827          31.382531         NaN

df['Date'] = pd.to_datetime(df['Date'])

m = df['Date'].notna()
s = (~m).groupby(m.cumsum()).cumsum()
df['Date'] = df['Date'].ffill() + pd.to_timedelta(s, unit='d')
print (df)
    AmbientTemperatue       Date
818          31.237499 2020-03-28
819          32.865974 2020-03-29
820          32.032558 2020-03-30
821          31.671166 2020-03-31
822          31.389927 2020-04-01
823          31.243660 2020-04-02
824          31.206777 2020-05-08
825          31.241503 2020-05-09
826          31.309531 2020-05-10
827          31.382531 2020-05-11