Python 将缺失的行添加到数据框

时间:2021-03-17 19:18:09

标签: python pandas

我有一个数据框,有时可能会有不完整的数据。例如下面的这个在 22 小时而不是 23 小时停止

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9

如何使用 Pandas 使其看起来像这样?

           Date  Hour  Interval     Source         ID  Number of Messages
0    2020-05-19     0         0          1  413379290                  23
1    2020-05-19     0        15          1  413379290                  36
2    2020-05-19     0        30          1  413379290                  31
3    2020-05-19     0        45          1  413379290                  14
4    2020-05-19     1         0          1  413379290                   3
..          ...   ...       ...        ...        ...                 ...
183  2020-05-20    21        45          1  413379290                   6
184  2020-05-20    22         0          1  413379290                   8
185  2020-05-20    22        15          1  413379290                   4
186  2020-05-20    22        30          1  413379290                   6
187  2020-05-20    22        45          1  413379290                   9
188  2020-05-20    23         0          1  413379290                   NaN
189  2020-05-20    23        15          1  413379290                   NaN
190  2020-05-20    23        30          1  413379290                   NaN
191  2020-05-20    23        45          1  413379290                   NaN

3 个答案:

答案 0 :(得分:2)

您可以使用 reindex 并创建缺失的小时数,您可以使用所有列中的现有值(消息数除外),创建一个 MultiIndex.from_frame,然后使用 MultiIndex.from_product通过 range(24) 重新创建更改 Hour 中现有值的所有值。然后在数据帧上,set_indexreindex 具有所有值

# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
    pd.MultiIndex.from_product(
        [lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
         for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
         ], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
    df.set_index(cols)
      .reindex(new_idx)
      .reset_index()
)

答案 1 :(得分:2)

我会采取的方法是找到日期的最小值和最大值,然后创建一个间隔为 15 分钟的日期范围。使用 df.merge 将 df 中的所有值添加到新创建的数据帧中。

请注意,日期从 2020-05-19 01:00:00 开始,而不是 00:00:00。所以最终输出也将从 01:00:00 而不是 00:00:00 开始

import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19',     1,         0,          1,  413379290,                  23],
['2020-05-19',     1,        15,          1,  413379290,                  36],
['2020-05-19',     1,        30,          1,  413379290,                  31],
['2020-05-19',     1,        45,          1,  413379290,                  14],
['2020-05-19',     2,         0,          1,  413379290,                   3],
['2020-05-20',    21,        45,          1,  413379290,                   6],
['2020-05-20',    22,         0,          1,  413379290,                   8],
['2020-05-20',    22,        15,          1,  413379290,                   4],
['2020-05-20',    22,        30,          1,  413379290,                   6],
['2020-05-20',    22,        45,          1,  413379290,                   9]]

df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])

print (df)

#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')

#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})

#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute

#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])

#forward fill Date, Source and ID 
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()

#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)

#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)

#Reset index to original column 
df1 = df1.reindex(c, axis=1)

print (df1)

原始数据帧:

        Date  Hour  Interval  Source         ID  Number of Messages
0 2020-05-19     1         0       1  413379290                  23
1 2020-05-19     1        15       1  413379290                  36
2 2020-05-19     1        30       1  413379290                  31
3 2020-05-19     1        45       1  413379290                  14
4 2020-05-19     2         0       1  413379290                   3
5 2020-05-20    21        45       1  413379290                   6
6 2020-05-20    22         0       1  413379290                   8
7 2020-05-20    22        15       1  413379290                   4
8 2020-05-20    22        30       1  413379290                   6
9 2020-05-20    22        45       1  413379290                   9

最终数据帧:

          Date  Hour  Interval  Source         ID  Number of Messages
0   2020-05-19     1         0       1  413379290                23.0
1   2020-05-19     1        15       1  413379290                36.0
2   2020-05-19     1        30       1  413379290                31.0
3   2020-05-19     1        45       1  413379290                14.0
4   2020-05-19     2         0       1  413379290                 3.0
..         ...   ...       ...     ...        ...                 ...
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

df1.tail(20) 为您提供:

          Date  Hour  Interval  Source         ID  Number of Messages
168 2020-05-19    19         0       1  413379290                 NaN
169 2020-05-19    19        15       1  413379290                 NaN
170 2020-05-19    19        30       1  413379290                 NaN
171 2020-05-19    19        45       1  413379290                 NaN
172 2020-05-19    20         0       1  413379290                 NaN
173 2020-05-19    20        15       1  413379290                 NaN
174 2020-05-19    20        30       1  413379290                 NaN
175 2020-05-19    20        45       1  413379290                 NaN
176 2020-05-19    21         0       1  413379290                 NaN
177 2020-05-19    21        15       1  413379290                 NaN
178 2020-05-19    21        30       1  413379290                 NaN
179 2020-05-20    21        45       1  413379290                 6.0
180 2020-05-20    22         0       1  413379290                 8.0
181 2020-05-20    22        15       1  413379290                 4.0
182 2020-05-20    22        30       1  413379290                 6.0
183 2020-05-20    22        45       1  413379290                 9.0
184 2020-05-20    23         0       1  413379290                 NaN
185 2020-05-20    23        15       1  413379290                 NaN
186 2020-05-20    23        30       1  413379290                 NaN
187 2020-05-20    23        45       1  413379290                 NaN

答案 2 :(得分:1)

您可以通过为前五列创建一个包含所有适当值的新数据框,然后与原始数据框合并以从适当行的 No of Messages 列中获取值来实现此目的。

import pandas as pd

df = pd.read_csv('test.csv')

dates = df['Date'].unique()

hrs = [hr for hr in range(24) for i in range(4)]*len(dates)

intervals = [0, 15, 30, 45] * 24 *len(dates)

new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]

new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])

new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)

new_df.to_excel('testit.xlsx')