将每个日期具有多行的缺失日期添加到数据框

时间:2021-05-28 16:09:59

标签: python pandas dataframe numpy

改写的问题:

我实际上是从几个监控我每日卡路里摄入量的 excel 文件中提取数据。我设法使用列表理解来生成日期。

我的输入:


        Date        Time         calories   duration
# 0     22/5/2021   Morning      420        50
# 1     22/5/2021   Afternoon    380        40
# 2     24/5/2021   Morning      390        45
# 3     24/5/2021   Afternoon    400        50
# 4     26/5/2021   Morning      350        45
# 5     26/5/2021   Afternoon    280        50
# 6     27/5/2021   Morning      300        44
# 7     27/5/2021   Afternoon    430        58

输出应该是这样的:

         Date       Time calories duration
0   22/5/2021    Morning      420       50
1   22/5/2021  Afternoon      380       40
2   23/5/2021    Morning      Nan      Nan
3   23/5/2021  Afternoon      Nan      Nan
4   24/5/2021    Morning      390       45
5   24/5/2021  Afternoon      400       50
6   25/5/2021    Morning      Nan      Nan
7   25/5/2021  Afternoon      Nan      Nan
8   26/5/2021    Morning      350       45
9   26/5/2021  Afternoon      280       50
10  27/5/2021    Morning      300       44
11  27/5/2021  Afternoon      430       58

2 个答案:

答案 0 :(得分:2)

构建 2 DatetimeIndex:一个来自原始数据框的第一个和最后一个日期(完整索引),另一个来自现有的 Date / Time 列(稀疏索引)。最后,您可以合并两个数据框并保留 caloriesduration 列中的数据。

# full index from first and last dates
dti = pd.date_range(df["Date"].min(),
                    df["Date"].max() + pd.DateOffset(hours=12),
                    freq="12H")

# new dataframe with the full index
df1 = pd.DataFrame({"Date": dti.date,
                    "Time": dti.map(lambda dt: "Afternoon" if dt.hour == 12 else "Morning")},
                    index=dti)

# set index from existing Date / Time columns
df2 = df.set_index(pd.to_datetime(df["Date"].astype(str) + " " + df["Time"] 
                     .replace({"Morning": "00:00:00", "Afternoon": "12:00:00"})))

# merge dataframes and keep data
out = df1.join(df2[["calories", "duration"]]).reset_index(drop=True)
>>> out
          Date       Time  calories  duration
0   2021-05-22    Morning     420.0      50.0
1   2021-05-22  Afternoon     380.0      40.0
2   2021-05-23    Morning       NaN       NaN
3   2021-05-23  Afternoon       NaN       NaN
4   2021-05-24    Morning     390.0      45.0
5   2021-05-24  Afternoon     400.0      50.0
6   2021-05-25    Morning       NaN       NaN
7   2021-05-25  Afternoon       NaN       NaN
8   2021-05-26    Morning     350.0      45.0
9   2021-05-26  Afternoon     280.0      50.0
10  2021-05-27    Morning     300.0      44.0
11  2021-05-27  Afternoon     430.0      58.0

答案 1 :(得分:2)

使用 .stack().unstack() 方法的解决方案:

用于创建示例数据框的代码:

import numpy as np
import pandas as pd

from io import StringIO
data = StringIO("""
     Date       Time  calories  duration
22/5/2021    Morning       420        50
22/5/2021  Afternoon       380        40
24/5/2021    Morning       390        45
24/5/2021  Afternoon       400        50
26/5/2021    Morning       350        45
26/5/2021  Afternoon       280        50
27/5/2021    Morning       300        44
27/5/2021  Afternoon       430        58
""")

df = pd.read_table(data, sep='\s+')
df
        Date        Time         calories   duration
# 0     22/5/2021   Morning      420        50
# 1     22/5/2021   Afternoon    380        40
# 2     24/5/2021   Morning      390        45
# 3     24/5/2021   Afternoon    400        50
# 4     26/5/2021   Morning      350        45
# 5     26/5/2021   Afternoon    280        50
# 6     27/5/2021   Morning      300        44
# 7     27/5/2021   Afternoon    430        58

解决方案:

# convert date column to datetime
df['Date'] = pd.to_datetime(df.Date, format="%d/%m/%Y")


(df
    .set_index(['Date', 'Time'])
    .unstack(fill_value=np.nan)
    .asfreq('D', fill_value=np.nan)
    .stack(dropna=False)
    .sort_index(ascending=[True, False])
    .reset_index()
)
#       Date        Time        calories    duration
# 0     2021-05-22  Morning     420.0       50.0
# 1     2021-05-22  Afternoon   380.0       40.0
# 2     2021-05-23  Morning     NaN         NaN
# 3     2021-05-23  Afternoon   NaN         NaN
# 4     2021-05-24  Morning     390.0       45.0
# 5     2021-05-24  Afternoon   400.0       50.0
# 6     2021-05-25  Morning     NaN         NaN
# 7     2021-05-25  Afternoon   NaN         NaN
# 8     2021-05-26  Morning     350.0       45.0
# 9     2021-05-26  Afternoon   280.0       50.0
# 10    2021-05-27  Morning     300.0       44.0
# 11    2021-05-27  Afternoon   430.0       58.0