我有一个类似的数据框
Start_MONTH Bucket Count Complete Partial
10/01/2015 0 57 91 0.66
11/01/2015 0 678 8 0.99
02/01/2016 0 68 12 0.12
10/01/2015 1 78 79 0.22
11/01/2015 1 99 56 0.67
1/01/2016 1 789 67 0.78
10/01/2015 3 678 178 0.780
11/01/2015 3 2880 578 0.678
我基本上需要在每个start_month(缺少12/01 / 2015,01 / 01/2016,...)中填写,每个像2的存储桶都丢失,其余的列(count,complete,partial)将是零表示丢失的存储桶和start_month。我认为使用relativedelta(months = + 1)会有所帮助,但不确定如何使用。
pandas as pd
data = [['10/01/2015',0 ,57 ,91,0.66],
['11/01/2015',0, 678, 8,0.99],
['02/01/2016',0,68,12,0.12],
['10/01/2015' ,1, 78,79,0.22],
['11/01/2015' ,1 ,99,56, 0.67],
['1/01/2016', 1 ,789,67,0.78],
['10/01/2015', 3,678, 178, 0.780],
['11/01/2015' ,3, 2880,578,0.678]]
df = pd.DataFrame(data, columns = ['Start_Month', 'Bucket', 'Count',
'Complete','Partial'])
基本上,我希望Start_month和bucket组都将其自身作为具有其他值0的组重复,即从2015年10月1日到2016年2月1日(缺少12/01 / 2015,01 / 01/2016)所有月份都在那里,0-3(丢失2)的水桶都需要在那里
我尝试了这个,它可以部分满足我的需求
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
s = df.groupby(['Bucket',pd.Grouper(key='Start_Month', freq='MS')])['Count','Complete','Partial'].sum()
df1 = (s.reset_index(level=0)
.groupby('Bucket')['Count','Complete','Partial']
.apply(lambda x: x.asfreq('MS'))
.reset_index())
这会增加一些丢失的月份,但不会为每个存储桶重复,并且不会在其间添加存储桶整数
答案 0 :(得分:1)
这是解决方案的一个坦率的破解,直到有人发布该如何真正完成:
data = [
['10/01/2015',0 ,57 ,91,0.66],
['11/01/2015',0, 678, 8,0.99],
['02/01/2016',0,68,12,0.12],
['10/01/2015' ,1, 78,79,0.22],
['11/01/2015' ,1 ,99,56, 0.67],
['1/01/2016', 1 ,789,67,0.78],
['10/01/2015', 3,678, 178, 0.780],
['11/01/2015' ,3, 2880,578,0.678]
]
df = pd.DataFrame(data, columns=['Start_Month', 'Bucket', 'Count', 'Complete','Partial'])#.set_index('Start_Month')
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
Start_Month Bucket Count Complete Partial
0 2015-10-01 0 57 91 0.660
1 2015-11-01 0 678 8 0.990
2 2016-02-01 0 68 12 0.120
3 2015-10-01 1 78 79 0.220
4 2015-11-01 1 99 56 0.670
5 2016-01-01 1 789 67 0.780
6 2015-10-01 3 678 178 0.780
7 2015-11-01 3 2880 578 0.678
df0 = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'),
[0]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})
Start_Month Bucket
0 2015-10-01 0
1 2015-11-01 0
2 2015-12-01 0
3 2016-01-01 0
4 2016-02-01 0
df_filt = df[(df['Start_Month'].isin(df0['Start_Month'])) & (df['Bucket'] == 0)]
df0 = pd.merge(df0, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
df0 = df0.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})
Start_Month Bucket Count Complete Partial
0 2015-10-01 0 57.0 91.0 0.66
1 2015-11-01 0 678.0 8.0 0.99
2 2015-12-01 0 NaN NaN NaN
3 2016-01-01 0 NaN NaN NaN
4 2016-02-01 0 68.0 12.0 0.12
(由于重复而未显示此内容...当然,您可以循环执行此操作)。然后将所有4个df合并在一起,并用零填充na。
# Concat
df_final = pd.concat([df0, df1, df2, df3], axis=0).fillna(0)
Start_Month Bucket Count Complete Partial
0 2015-10-01 0 57.0 91.0 0.660
1 2015-11-01 0 678.0 8.0 0.990
2 2015-12-01 0 0.0 0.0 0.000
3 2016-01-01 0 0.0 0.0 0.000
4 2016-02-01 0 68.0 12.0 0.120
0 2015-10-01 1 78.0 79.0 0.220
1 2015-11-01 1 99.0 56.0 0.670
2 2015-12-01 1 0.0 0.0 0.000
3 2016-01-01 1 789.0 67.0 0.780
4 2016-02-01 1 0.0 0.0 0.000
0 2015-10-01 2 0.0 0.0 0.000
1 2015-11-01 2 0.0 0.0 0.000
2 2015-12-01 2 0.0 0.0 0.000
3 2016-01-01 2 0.0 0.0 0.000
4 2016-02-01 2 0.0 0.0 0.000
0 2015-10-01 3 678.0 178.0 0.780
1 2015-11-01 3 2880.0 578.0 0.678
2 2015-12-01 3 0.0 0.0 0.000
3 2016-01-01 3 0.0 0.0 0.000
4 2016-02-01 3 0.0 0.0 0.000
def get_separate_df(df, bucket_num):
df_bucket = pd.DataFrame([pd.date_range('2015-10-01', '2016-02-01', freq='MS'),
[bucket_num]*5]).T.rename(columns={0: 'Start_Month', 1:'Bucket'})
df_filt = df[(df['Start_Month'].isin(df_bucket['Start_Month'])) & \
(df['Bucket'] == bucket_num)]
df_bucket = pd.merge(df_bucket, df_filt, left_on='Start_Month', right_on='Start_Month', how='outer')
df_bucket = df_bucket.drop('Bucket_y', axis=1).rename(columns={'Bucket_x': 'Bucket'})
return df_bucket
dfs = [get_separate_df(df, i) for i in range(4)]
# Concat
df_final = pd.concat(dfs, axis=0).fillna(0)
对于注释中的问题,您可以得到一个空的数据框,其中包含重复的日期和存储桶序列,如下所示:
bucket_list = [ele for ele in [0,1,2,3] for i in range(5)]
dates = list(pd.date_range('2015-10-01', '2016-02-01', freq='MS'))*4
df = pd.DataFrame(data=[dates, bucket_list]).T.rename(columns={0:'Start_Month', 1:'Bucket'})
Output:
Start_Month Bucket
0 2015-10-01 0
1 2015-11-01 0
2 2015-12-01 0
3 2016-01-01 0
4 2016-02-01 0
5 2015-10-01 1
6 2015-11-01 1
7 2015-12-01 1
8 2016-01-01 1
9 2016-02-01 1
10 2015-10-01 2
11 2015-11-01 2
12 2015-12-01 2
13 2016-01-01 2
14 2016-02-01 2
15 2015-10-01 3
16 2015-11-01 3
17 2015-12-01 3
18 2016-01-01 3
19 2016-02-01 3
答案 1 :(得分:0)
写了一个类似的字,但只是概括了一下
import pandas as pd
import numpy as np
# converting date string to date
df['Start_Month'] = pd.to_datetime(df['Start_Month'])
# finding the the date range and increasin by 1 month start
rng = pd.date_range(df['Start_Month'].min(),df['Start_Month'].max(), freq='MS')
# creating date dataframe
df1 = pd.DataFrame({ 'Start_Month': rng})
# Converting bucket field to integer
df['Bucket'] = df['Bucket'].astype(int)
# finding the bucket values max and min
Bucket=np.arange(df['Bucket'].min(),df['Bucket'].max()+1,1)
# Repeating the date range for every bucket
df1=pd.concat([df1]*len(Bucket))
# repeating bucket values to each date
df1['Bucket']=np.repeat(Bucket, len(rng))
# merging to the previous dataframe and filling it with 0
merged_left = pd.merge(left=df1, right=df, how='left', on=['Start_Month','Bucket']).fillna(0)