同时填写缺失的日期和时段

时间:2020-09-14 20:11:05

标签: python pandas jupyter-notebook

我有一个类似的数据框

Start_MONTH  Bucket Count Complete Partial

 10/01/2015      0       57     91      0.66

 11/01/2015      0       678    8       0.99

 02/01/2016      0        68    12       0.12

 10/01/2015      1       78     79      0.22

 11/01/2015      1       99     56     0.67

 1/01/2016       1       789    67     0.78

 10/01/2015      3       678    178    0.780

11/01/2015       3       2880   578     0.678

我基本上需要在每个start_month(缺少12/01 / 2015,01 / 01/2016,...)中填写,每个像2的存储桶都丢失,其余的列(count,complete,partial)将是零表示丢失的存储桶和start_month。我认为使用relativedelta(months = + 1)会有所帮助,但不确定如何使用。

pandas as pd
data =  [['10/01/2015',0 ,57 ,91,0.66],
 ['11/01/2015',0, 678, 8,0.99],
  ['02/01/2016',0,68,12,0.12],
  ['10/01/2015' ,1, 78,79,0.22],
  ['11/01/2015' ,1 ,99,56, 0.67],
  ['1/01/2016', 1 ,789,67,0.78],
  ['10/01/2015', 3,678, 178, 0.780],
  ['11/01/2015' ,3, 2880,578,0.678]]
df = pd.DataFrame(data, columns = ['Start_Month', 'Bucket', 'Count', 
'Complete','Partial']) 

基本上,我希望Start_month和bucket组都将其自身作为具有其他值0的组重复,即从2015年10月1日到2016年2月1日(缺少12/01 / 2015,01 / 01/2016)所有月份都在那里,0-3(丢失2)的水桶都需要在那里 enter image description here

我尝试了这个,它可以部分满足我的需求

df['Start_Month'] = pd.to_datetime(df['Start_Month'])
s = df.groupby(['Bucket',pd.Grouper(key='Start_Month', freq='MS')])['Count','Complete','Partial'].sum()
df1 = (s.reset_index(level=0)
    .groupby('Bucket')['Count','Complete','Partial']
    .apply(lambda x: x.asfreq('MS'))
    .reset_index())

这会增加一些丢失的月份,但不会为每个存储桶重复,并且不会在其间添加存储桶整数

2 个答案:

答案 0 :(得分:1)

这是解决方案的一个坦率的破解,直到有人发布该如何真正完成:

启动df

data =  [
    ['10/01/2015',0 ,57 ,91,0.66],
    ['11/01/2015',0, 678, 8,0.99],
    ['02/01/2016',0,68,12,0.12],
    ['10/01/2015' ,1, 78,79,0.22],
    ['11/01/2015' ,1 ,99,56, 0.67],
    ['1/01/2016', 1 ,789,67,0.78],
    ['10/01/2015', 3,678, 178, 0.780],
    ['11/01/2015' ,3, 2880,578,0.678]
]
df = pd.DataFrame(data, columns=['Start_Month', 'Bucket', 'Count', 'Complete','Partial'])#.set_index('Start_Month')
df['Start_Month'] = pd.to_datetime(df['Start_Month'])

    Start_Month Bucket  Count   Complete    Partial
0   2015-10-01     0      57        91       0.660
1   2015-11-01     0      678        8       0.990
2   2016-02-01     0      68        12       0.120
3   2015-10-01     1      78        79       0.220
4   2015-11-01     1      99        56       0.670
5   2016-01-01     1      789       67       0.780
6   2015-10-01     3      678      178       0.780
7   2015-11-01     3      2880     578       0.678

为存储桶0制作一个完整日期的单独df

df0 = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'), 
                    [0]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})

    Start_Month Bucket
0   2015-10-01  0
1   2015-11-01  0
2   2015-12-01  0
3   2016-01-01  0
4   2016-02-01  0

按相应的日期和时段过滤原始df,并将结果与​​df0合并

df_filt = df[(df['Start_Month'].isin(df0['Start_Month'])) & (df['Bucket'] == 0)]
df0 = pd.merge(df0, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
df0 = df0.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})

    Start_Month Bucket  Count   Complete Partial
0   2015-10-01  0        57.0   91.0     0.66
1   2015-11-01  0        678.0  8.0      0.99
2   2015-12-01  0        NaN    NaN      NaN
3   2016-01-01  0        NaN    NaN      NaN
4   2016-02-01  0        68.0   12.0     0.12

对存储区1、2和3重复此过程,创建df1,df2,df3。

(由于重复而未显示此内容...当然,您可以循环执行此操作)。然后将所有4个df合并在一起,并用零填充na。

# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)

    Start_Month Bucket       Count   Complete   Partial
0   2015-10-01       0       57.0        91.0   0.660
1   2015-11-01       0       678.0       8.0    0.990
2   2015-12-01       0       0.0         0.0    0.000
3   2016-01-01       0       0.0         0.0    0.000
4   2016-02-01       0       68.0        12.0   0.120
0   2015-10-01       1       78.0        79.0   0.220
1   2015-11-01       1       99.0        56.0   0.670
2   2015-12-01       1       0.0         0.0    0.000
3   2016-01-01       1       789.0       67.0   0.780
4   2016-02-01       1       0.0         0.0    0.000
0   2015-10-01       2       0.0         0.0    0.000
1   2015-11-01       2       0.0         0.0    0.000
2   2015-12-01       2       0.0         0.0    0.000
3   2016-01-01       2       0.0         0.0    0.000
4   2016-02-01       2       0.0         0.0    0.000
0   2015-10-01       3       678.0       178.0  0.780
1   2015-11-01       3       2880.0      578.0  0.678
2   2015-12-01       3       0.0         0.0    0.000
3   2016-01-01       3       0.0         0.0    0.000
4   2016-02-01       3       0.0         0.0    0.000

更新:显示完全循环的代码,并在评论中回答您的问题。

def get_separate_df(df, bucket_num):
    df_bucket = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'), 
                        [bucket_num]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})
    df_filt = df[(df['Start_Month'].isin(df_bucket['Start_Month'])) & \
                        (df['Bucket'] == bucket_num)]
    df_bucket = pd.merge(df_bucket, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
    df_bucket = df_bucket.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})

    return df_bucket

dfs = [get_separate_df(df, i) for i in range(4)] 

# Concat
df_final = pd.concat(dfs, axis=0).fillna(0)

对于注释中的问题,您可以得到一个空的数据框,其中包含重复的日期和存储桶序列,如下所示:

bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})

Output:
    Start_Month Bucket
0   2015-10-01  0
1   2015-11-01  0
2   2015-12-01  0
3   2016-01-01  0
4   2016-02-01  0
5   2015-10-01  1
6   2015-11-01  1
7   2015-12-01  1
8   2016-01-01  1
9   2016-02-01  1
10  2015-10-01  2
11  2015-11-01  2
12  2015-12-01  2
13  2016-01-01  2
14  2016-02-01  2
15  2015-10-01  3
16  2015-11-01  3
17  2015-12-01  3
18  2016-01-01  3
19  2016-02-01  3

答案 1 :(得分:0)

写了一个类似的字,但只是概括了一下

 import pandas as pd
 import numpy as np

 # converting date string to date
 df['Start_Month'] = pd.to_datetime(df['Start_Month'])

 # finding the the date range and increasin by 1 month start
 rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')

 # creating date dataframe
 df1 = pd.DataFrame({ 'Start_Month': rng})

 # Converting bucket field to integer
 df['Bucket'] = df['Bucket'].astype(int)

 # finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)

 # Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))

 # repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))

# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)