我有一个数据框,有时可能会有不完整的数据。例如下面的这个在 22 小时而不是 23 小时停止
Date Hour Interval Source ID Number of Messages
0 2020-05-19 0 0 1 413379290 23
1 2020-05-19 0 15 1 413379290 36
2 2020-05-19 0 30 1 413379290 31
3 2020-05-19 0 45 1 413379290 14
4 2020-05-19 1 0 1 413379290 3
.. ... ... ... ... ... ...
183 2020-05-20 21 45 1 413379290 6
184 2020-05-20 22 0 1 413379290 8
185 2020-05-20 22 15 1 413379290 4
186 2020-05-20 22 30 1 413379290 6
187 2020-05-20 22 45 1 413379290 9
如何使用 Pandas 使其看起来像这样?
Date Hour Interval Source ID Number of Messages
0 2020-05-19 0 0 1 413379290 23
1 2020-05-19 0 15 1 413379290 36
2 2020-05-19 0 30 1 413379290 31
3 2020-05-19 0 45 1 413379290 14
4 2020-05-19 1 0 1 413379290 3
.. ... ... ... ... ... ...
183 2020-05-20 21 45 1 413379290 6
184 2020-05-20 22 0 1 413379290 8
185 2020-05-20 22 15 1 413379290 4
186 2020-05-20 22 30 1 413379290 6
187 2020-05-20 22 45 1 413379290 9
188 2020-05-20 23 0 1 413379290 NaN
189 2020-05-20 23 15 1 413379290 NaN
190 2020-05-20 23 30 1 413379290 NaN
191 2020-05-20 23 45 1 413379290 NaN
答案 0 :(得分:2)
您可以使用 reindex
并创建缺失的小时数,您可以使用所有列中的现有值(消息数除外),创建一个 MultiIndex.from_frame
,然后使用 MultiIndex.from_product
通过 range(24)
重新创建更改 Hour 中现有值的所有值。然后在数据帧上,set_index
和 reindex
具有所有值
# all except the one you want nan in
cols = ['Date','Hour', 'Interval', 'Source','ID']
#create the multiindex with all values
new_idx = (
pd.MultiIndex.from_product(
[lv if col != 'Hour' else range(24) #replace existing values by range 0 to 23
for col, lv in zip(cols, pd.MultiIndex.from_frame(df[cols]).levels)
], names=cols)
)
#reindex the original df, you can reassign to same df if you want
new_df = (
df.set_index(cols)
.reindex(new_idx)
.reset_index()
)
答案 1 :(得分:2)
我会采取的方法是找到日期的最小值和最大值,然后创建一个间隔为 15 分钟的日期范围。使用 df.merge 将 df 中的所有值添加到新创建的数据帧中。
请注意,日期从 2020-05-19 01:00:00 开始,而不是 00:00:00。所以最终输出也将从 01:00:00 而不是 00:00:00 开始
import pandas as pd
c = ['Date','Hour','Interval','Source','ID','Number of Messages']
d = [
['2020-05-19', 1, 0, 1, 413379290, 23],
['2020-05-19', 1, 15, 1, 413379290, 36],
['2020-05-19', 1, 30, 1, 413379290, 31],
['2020-05-19', 1, 45, 1, 413379290, 14],
['2020-05-19', 2, 0, 1, 413379290, 3],
['2020-05-20', 21, 45, 1, 413379290, 6],
['2020-05-20', 22, 0, 1, 413379290, 8],
['2020-05-20', 22, 15, 1, 413379290, 4],
['2020-05-20', 22, 30, 1, 413379290, 6],
['2020-05-20', 22, 45, 1, 413379290, 9]]
df = pd.DataFrame(d,columns=c)
df['Date'] = pd.to_datetime(df['Date'])
print (df)
#first get the start and end period by adding Hour and Interval to Date
df['DateFull'] = df.Date + pd.to_timedelta(df.Hour,unit='h') + pd.to_timedelta(df.Interval,unit='m')
#Create a range of dates with 15 mins interval from Start Date (including Hour & Min) to Last Day + 23:45
df1 = pd.DataFrame({'DateFull':pd.date_range(df.DateFull.min(), df.DateFull.max().floor('d') + pd.to_timedelta('23:45:00'), freq='15T')})
#Create columns with Hour and Interval based on new Date Range
df1['Hour'] = df1.DateFull.dt.hour
df1['Interval'] = df1.DateFull.dt.minute
#Merge on DateFull, Hour, Interval to get the full set merged with original DF
df1 = df1.merge(df, how='left', on=['DateFull','Hour','Interval'])
#forward fill Date, Source and ID
df1[['Date','Source','ID']] = df1[['Date','Source','ID']].ffill()
#convert Source and ID to int
df1[['Source','ID']] = df1[['Source','ID']].astype(int)
#Drop DateFull as it is no longer needed
df1.drop(columns ='DateFull',inplace=True)
#Reset index to original column
df1 = df1.reindex(c, axis=1)
print (df1)
原始数据帧:
Date Hour Interval Source ID Number of Messages
0 2020-05-19 1 0 1 413379290 23
1 2020-05-19 1 15 1 413379290 36
2 2020-05-19 1 30 1 413379290 31
3 2020-05-19 1 45 1 413379290 14
4 2020-05-19 2 0 1 413379290 3
5 2020-05-20 21 45 1 413379290 6
6 2020-05-20 22 0 1 413379290 8
7 2020-05-20 22 15 1 413379290 4
8 2020-05-20 22 30 1 413379290 6
9 2020-05-20 22 45 1 413379290 9
最终数据帧:
Date Hour Interval Source ID Number of Messages
0 2020-05-19 1 0 1 413379290 23.0
1 2020-05-19 1 15 1 413379290 36.0
2 2020-05-19 1 30 1 413379290 31.0
3 2020-05-19 1 45 1 413379290 14.0
4 2020-05-19 2 0 1 413379290 3.0
.. ... ... ... ... ... ...
183 2020-05-20 22 45 1 413379290 9.0
184 2020-05-20 23 0 1 413379290 NaN
185 2020-05-20 23 15 1 413379290 NaN
186 2020-05-20 23 30 1 413379290 NaN
187 2020-05-20 23 45 1 413379290 NaN
df1.tail(20)
为您提供:
Date Hour Interval Source ID Number of Messages
168 2020-05-19 19 0 1 413379290 NaN
169 2020-05-19 19 15 1 413379290 NaN
170 2020-05-19 19 30 1 413379290 NaN
171 2020-05-19 19 45 1 413379290 NaN
172 2020-05-19 20 0 1 413379290 NaN
173 2020-05-19 20 15 1 413379290 NaN
174 2020-05-19 20 30 1 413379290 NaN
175 2020-05-19 20 45 1 413379290 NaN
176 2020-05-19 21 0 1 413379290 NaN
177 2020-05-19 21 15 1 413379290 NaN
178 2020-05-19 21 30 1 413379290 NaN
179 2020-05-20 21 45 1 413379290 6.0
180 2020-05-20 22 0 1 413379290 8.0
181 2020-05-20 22 15 1 413379290 4.0
182 2020-05-20 22 30 1 413379290 6.0
183 2020-05-20 22 45 1 413379290 9.0
184 2020-05-20 23 0 1 413379290 NaN
185 2020-05-20 23 15 1 413379290 NaN
186 2020-05-20 23 30 1 413379290 NaN
187 2020-05-20 23 45 1 413379290 NaN
答案 2 :(得分:1)
您可以通过为前五列创建一个包含所有适当值的新数据框,然后与原始数据框合并以从适当行的 No of Messages 列中获取值来实现此目的。
import pandas as pd
df = pd.read_csv('test.csv')
dates = df['Date'].unique()
hrs = [hr for hr in range(24) for i in range(4)]*len(dates)
intervals = [0, 15, 30, 45] * 24 *len(dates)
new_df = pd.DataFrame()
new_df['Date'] = [dt for dt in dates for i in range(24*4)]
new_df['Hour'] = hrs
new_df['Interval'] = intervals
new_df['Source'] = df['Source'].iloc[0]
new_df['ID'] = str(df['ID'].iloc[0])
new_df = new_df.merge(df, how='left', on=['Date', 'Hour', 'Interval']).drop(['Source_y', 'ID_y'], axis=1)
new_df.rename(columns={'Source_x':'Source', 'ID_x':'ID'},inplace=True)
new_df.to_excel('testit.xlsx')