我正试图向前填充缺失的行,以完成数据集中缺失的时间序列行。
数据集的大小巨大。超过1亿行。
原始源数据集如下所示。
col1 col2 col3 col4 col5 col6
0 2020-01-01 b1 c1 1 9 17
1 2020-01-05 b1 c1 2 10 18
2 2020-01-02 b2 c2 3 11 19
3 2020-01-04 b2 c2 4 12 20
4 2020-01-10 b3 c3 5 13 21
5 2020-01-15 b3 c3 6 14 22
6 2020-01-16 b4 c4 7 15 23
7 2020-01-30 b4 c4 8 16 24
期望的输出如下
col1 col2 col3 col4 col5 col6
0 2020-01-01 b1 c1 1.0 9.0 17.0
1 2020-01-02 b1 c1 1.0 9.0 17.0
2 2020-01-03 b1 c1 1.0 9.0 17.0
3 2020-01-04 b1 c1 1.0 9.0 17.0
4 2020-01-05 b1 c1 2.0 10.0 18.0
5 2020-01-02 b2 c2 3.0 11.0 19.0
6 2020-01-03 b2 c2 3.0 11.0 19.0
7 2020-01-04 b2 c2 4.0 12.0 20.0
8 2020-01-10 b3 c3 5.0 13.0 21.0
9 2020-01-11 b3 c3 5.0 13.0 21.0
10 2020-01-12 b3 c3 5.0 13.0 21.0
11 2020-01-13 b3 c3 5.0 13.0 21.0
12 2020-01-14 b3 c3 5.0 13.0 21.0
13 2020-01-15 b3 c3 6.0 14.0 22.0
14 2020-01-16 b4 c4 7.0 15.0 23.0
15 2020-01-17 b4 c4 7.0 15.0 23.0
16 2020-01-18 b4 c4 7.0 15.0 23.0
17 2020-01-19 b4 c4 7.0 15.0 23.0
18 2020-01-20 b4 c4 7.0 15.0 23.0
19 2020-01-21 b4 c4 7.0 15.0 23.0
20 2020-01-22 b4 c4 7.0 15.0 23.0
21 2020-01-23 b4 c4 7.0 15.0 23.0
22 2020-01-24 b4 c4 7.0 15.0 23.0
23 2020-01-25 b4 c4 7.0 15.0 23.0
24 2020-01-26 b4 c4 7.0 15.0 23.0
25 2020-01-27 b4 c4 7.0 15.0 23.0
26 2020-01-28 b4 c4 7.0 15.0 23.0
27 2020-01-29 b4 c4 7.0 15.0 23.0
28 2020-01-30 b4 c4 8.0 16.0 24.0
我需要对col2
和col3
进行分组,以便为每种组合填充col1
中缺少的时间序列行。
当前,我有下面的代码正在运行,但是由于for循环,它的运行速度非常慢。
import pandas as pd
import numpy as np
def fill_missing_timeseries(subset_df, date_col):
if subset_df[date_col].dtype != 'datetime64[ns]':
subset_df[date_col] = pd.to_datetime(subset_df[date_col], infer_datetime_format=True)
min_date = subset_df[date_col].min()
max_date = subset_df[date_col].max()
# generate a continous date column between the min and max date values
date_range = pd.date_range(start=min_date, end=max_date, freq='D',)
new_df = pd.DataFrame()
new_df[date_col] = date_range
# join newly generated df with input df to get all the columns
new_df = pd.merge(new_df, subset_df, how='left')
# forward fill missing NaN values
new_df = new_df.ffill()
return new_df
orig_df = pd.DataFrame({'col1': ['2020-01-01','2020-01-05', '2020-01-02','2020-01-04','2020-01-10','2020-01-15','2020-01-16','2020-01-30'],
'col2': ['b1','b1','b2','b2','b3','b3','b4','b4'],
'col3': ['c1','c1','c2','c2','c3','c3','c4','c4'],
'col4': [1,2,3,4,5,6,7,8],
'col5': [9,10,11,12,13,14,15,16],
'col6': [17,18,19,20,21,22,23,24],
})
data = []
grouped_by_df = orig_df.groupby(['col2', 'col3']).size().reset_index().rename(columns={0:'count'})
for index, row in grouped_by_df.iterrows():
subset_df = orig_df[(orig_df.col2 == row[0]) & (orig_df.col3 == row[1])]
subset_filled_df = fill_missing_timeseries(subset_df, date_col='col1')
data.append(subset_filled_df)
desired_df = pd.concat(data, ignore_index=True)
有什么方法可以避免for循环并发送整个数据集以创建缺少的行和ffill()?
感谢并感谢帮助。
更新: 上面的代码可以正常工作,但是太慢了。仅需30万行,就需要花费30多分钟的时间。因此,我正在寻求帮助以使其更快并避免for循环。
答案 0 :(得分:2)
看起来像resample
上的groupby
可以工作:
(df.set_index('col1').groupby(['col2', 'col3'])
.resample('D').ffill()
.reset_index(['col2','col3'], drop=True)
.reset_index()
)
输出:
col1 col2 col3 col4 col5 col6
0 2020-01-01 b1 c1 1 9 17
1 2020-01-02 b1 c1 1 9 17
2 2020-01-03 b1 c1 1 9 17
3 2020-01-04 b1 c1 1 9 17
4 2020-01-05 b1 c1 2 10 18
5 2020-01-02 b2 c2 3 11 19
6 2020-01-03 b2 c2 3 11 19
7 2020-01-04 b2 c2 4 12 20
8 2020-01-10 b3 c3 5 13 21
9 2020-01-11 b3 c3 5 13 21
10 2020-01-12 b3 c3 5 13 21
11 2020-01-13 b3 c3 5 13 21
12 2020-01-14 b3 c3 5 13 21
13 2020-01-15 b3 c3 6 14 22
14 2020-01-16 b4 c4 7 15 23
15 2020-01-17 b4 c4 7 15 23
16 2020-01-18 b4 c4 7 15 23
17 2020-01-19 b4 c4 7 15 23
18 2020-01-20 b4 c4 7 15 23
19 2020-01-21 b4 c4 7 15 23
20 2020-01-22 b4 c4 7 15 23
21 2020-01-23 b4 c4 7 15 23
22 2020-01-24 b4 c4 7 15 23
23 2020-01-25 b4 c4 7 15 23
24 2020-01-26 b4 c4 7 15 23
25 2020-01-27 b4 c4 7 15 23
26 2020-01-28 b4 c4 7 15 23
27 2020-01-29 b4 c4 7 15 23
28 2020-01-30 b4 c4 8 16 24