python-用groupby填充熊猫

时间:2020-07-16 19:05:13

标签: python pandas numpy dataframe pandas-groupby

我正试图向前填充缺失的行,以完成数据集中缺失的时间序列行。

数据集的大小巨大。超过1亿行。

原始源数据集如下所示。

         col1 col2 col3  col4  col5  col6
0  2020-01-01   b1   c1     1     9    17
1  2020-01-05   b1   c1     2    10    18
2  2020-01-02   b2   c2     3    11    19
3  2020-01-04   b2   c2     4    12    20
4  2020-01-10   b3   c3     5    13    21
5  2020-01-15   b3   c3     6    14    22
6  2020-01-16   b4   c4     7    15    23
7  2020-01-30   b4   c4     8    16    24

期望的输出如下

         col1 col2 col3  col4  col5  col6
0  2020-01-01   b1   c1   1.0   9.0  17.0
1  2020-01-02   b1   c1   1.0   9.0  17.0
2  2020-01-03   b1   c1   1.0   9.0  17.0
3  2020-01-04   b1   c1   1.0   9.0  17.0
4  2020-01-05   b1   c1   2.0  10.0  18.0
5  2020-01-02   b2   c2   3.0  11.0  19.0
6  2020-01-03   b2   c2   3.0  11.0  19.0
7  2020-01-04   b2   c2   4.0  12.0  20.0
8  2020-01-10   b3   c3   5.0  13.0  21.0
9  2020-01-11   b3   c3   5.0  13.0  21.0
10 2020-01-12   b3   c3   5.0  13.0  21.0
11 2020-01-13   b3   c3   5.0  13.0  21.0
12 2020-01-14   b3   c3   5.0  13.0  21.0
13 2020-01-15   b3   c3   6.0  14.0  22.0
14 2020-01-16   b4   c4   7.0  15.0  23.0
15 2020-01-17   b4   c4   7.0  15.0  23.0
16 2020-01-18   b4   c4   7.0  15.0  23.0
17 2020-01-19   b4   c4   7.0  15.0  23.0
18 2020-01-20   b4   c4   7.0  15.0  23.0
19 2020-01-21   b4   c4   7.0  15.0  23.0
20 2020-01-22   b4   c4   7.0  15.0  23.0
21 2020-01-23   b4   c4   7.0  15.0  23.0
22 2020-01-24   b4   c4   7.0  15.0  23.0
23 2020-01-25   b4   c4   7.0  15.0  23.0
24 2020-01-26   b4   c4   7.0  15.0  23.0
25 2020-01-27   b4   c4   7.0  15.0  23.0
26 2020-01-28   b4   c4   7.0  15.0  23.0
27 2020-01-29   b4   c4   7.0  15.0  23.0
28 2020-01-30   b4   c4   8.0  16.0  24.0

我需要对col2col3进行分组,以便为​​每种组合填充col1中缺少的时间序列行。

当前,我有下面的代码正在运行,但是由于for循环,它的运行速度非常慢。

import pandas as pd
import numpy as np

def fill_missing_timeseries(subset_df, date_col):
    if subset_df[date_col].dtype != 'datetime64[ns]':
        subset_df[date_col] = pd.to_datetime(subset_df[date_col], infer_datetime_format=True)
    min_date = subset_df[date_col].min()
    max_date = subset_df[date_col].max()

    # generate a continous date column between the min and max date values
    date_range = pd.date_range(start=min_date, end=max_date, freq='D',)
    new_df = pd.DataFrame()
    new_df[date_col] = date_range
    
    # join newly generated df with input df to get all the columns
    new_df = pd.merge(new_df, subset_df, how='left')
    
    # forward fill missing NaN values
    new_df = new_df.ffill()
    return new_df

orig_df = pd.DataFrame({'col1': ['2020-01-01','2020-01-05', '2020-01-02','2020-01-04','2020-01-10','2020-01-15','2020-01-16','2020-01-30'],
                        'col2': ['b1','b1','b2','b2','b3','b3','b4','b4'],
                        'col3': ['c1','c1','c2','c2','c3','c3','c4','c4'],
                        'col4': [1,2,3,4,5,6,7,8],
                        'col5': [9,10,11,12,13,14,15,16],
                        'col6': [17,18,19,20,21,22,23,24],
                       })
data = []
grouped_by_df = orig_df.groupby(['col2', 'col3']).size().reset_index().rename(columns={0:'count'})
for index, row in grouped_by_df.iterrows():
    subset_df = orig_df[(orig_df.col2 == row[0]) & (orig_df.col3 == row[1])]
    subset_filled_df = fill_missing_timeseries(subset_df, date_col='col1')
    data.append(subset_filled_df)
desired_df = pd.concat(data, ignore_index=True)

有什么方法可以避免for循环并发送整个数据集以创建缺少的行和ffill()?

感谢并感谢帮助。

更新: 上面的代码可以正常工作,但是太慢了。仅需30万行,就需要花费30多分钟的时间。因此,我正在寻求帮助以使其更快并避免for循环。

1 个答案:

答案 0 :(得分:2)

看起来像resample上的groupby可以工作:

(df.set_index('col1').groupby(['col2', 'col3'])
   .resample('D').ffill()
   .reset_index(['col2','col3'], drop=True)
   .reset_index()
)

输出:

         col1 col2 col3  col4  col5  col6
0  2020-01-01   b1   c1     1     9    17
1  2020-01-02   b1   c1     1     9    17
2  2020-01-03   b1   c1     1     9    17
3  2020-01-04   b1   c1     1     9    17
4  2020-01-05   b1   c1     2    10    18
5  2020-01-02   b2   c2     3    11    19
6  2020-01-03   b2   c2     3    11    19
7  2020-01-04   b2   c2     4    12    20
8  2020-01-10   b3   c3     5    13    21
9  2020-01-11   b3   c3     5    13    21
10 2020-01-12   b3   c3     5    13    21
11 2020-01-13   b3   c3     5    13    21
12 2020-01-14   b3   c3     5    13    21
13 2020-01-15   b3   c3     6    14    22
14 2020-01-16   b4   c4     7    15    23
15 2020-01-17   b4   c4     7    15    23
16 2020-01-18   b4   c4     7    15    23
17 2020-01-19   b4   c4     7    15    23
18 2020-01-20   b4   c4     7    15    23
19 2020-01-21   b4   c4     7    15    23
20 2020-01-22   b4   c4     7    15    23
21 2020-01-23   b4   c4     7    15    23
22 2020-01-24   b4   c4     7    15    23
23 2020-01-25   b4   c4     7    15    23
24 2020-01-26   b4   c4     7    15    23
25 2020-01-27   b4   c4     7    15    23
26 2020-01-28   b4   c4     7    15    23
27 2020-01-29   b4   c4     7    15    23
28 2020-01-30   b4   c4     8    16    24