假设我在Pandas数据框中保存了以下数据集 - 请注意最后一列[Status]是我想要创建的列:
Department Employee Issue Date Submission Date ***Status***
A Joe 18/05/2014 25/06/2014 0
A Joe 1/06/2014 28/06/2014 1
A Joe 23/06/2014 30/06/2014 2
A Mark 1/03/2015 13/03/2015 0
A Mark 23/04/2015 15/04/2015 0
A William 15/07/2016 30/07/2016 0
A William 1/08/2016 23/08/2016 0
A William 20/08/2016 19/08/2016 1
B Liz 18/05/2014 7/06/2014 0
B Liz 1/06/2014 15/06/2014 1
B Liz 23/06/2014 16/06/2014 0
B John 1/03/2015 13/03/2015 0
B John 23/04/2015 15/04/2015 0
B Alex 15/07/2016 30/07/2016 0
B Alex 1/08/2016 23/08/2016 0
B Alex 20/08/2016 19/08/2016 1
我想根据以下条件创建一个额外的列[状态]:
例如:对于部门A中的员工Joe。当[发布日期] =' 1/06 / 2014'时,前一行的[提交日期]在[发布日期]之后因此,[状态] =第2行为第1行。类似地,当[发布日期] =' 23/06 / 2014'时,第1行& 2 [提交日期]都在[发布日期]之后,因此第3行的[状态] = 2.我们需要对部门和员工的每个唯一组合执行此计算。
答案 0 :(得分:0)
这个问题是在6个月前发布的,但希望我的回答仍能提供一些帮助。
首先,导入库并创建数据框:
# import libraries
import numpy as np
import pandas as pd
# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
'Employee' : ['Joe']*3 +\
['Mark']*2 +\
['William']*3 +\
['Liz']*3 +\
['John']*2 +\
['Alex']*3,
'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016',
'18/05/2014', '1/06/2014', '23/06/2014',
'1/03/2015', '23/04/2015',
'15/07/2016', '1/08/2016', '20/08/2016'],
'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016',
'7/06/2014', '15/06/2014', '16/06/2014',
'13/03/2015', '15/04/2015',
'30/07/2016', '23/08/2016', '19/08/2016']})
其次,将发行日期和提交日期转换为datetime:
# Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
dayfirst = True)
第三步,重置索引并按部门,员工和发布日期对值进行排序:
# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
'Employee',
'Issue Date'],
inplace = True)
第四,按部门分组,员工;累计计数行数;插入原始df:
# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
'grouped count',
df.groupby(['Department',
'Employee']).cumcount())
第五,创建一个no_issue和no_submission数据帧,并在Department和Employee上将它们合并在一起:
# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)
# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)
# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
how = 'outer',
on = ['Department',
'Employee'])
这会将提交日期与每个部门,员工组的发行日期数量
重复以下是Joe的样子:
第六,创建一个数据框,只保留分组count_x小于分组count_y的行,然后按部门,员工和发布日期排序:
# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
'Employee',
'Issue Date'])
第七,将状态列插入布尔值,其中发布日期小于提交日期:
# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
'Status',
merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])
第八,按部门,员工和发行日期分组,对状态求和,并重置索引:
# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
'Employee',
'Issue Date']).agg({'Status' : np.sum}).reset_index()
这将返回一个数据框,其中包含所有正确的状态,减去每个部门,员工组的最短发布日期
第九,按部门和员工对原始合并数据框进行分组,找到最小发行日期,并重置索引:
# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
'Employee']).agg({'Issue Date' : 'min'}).reset_index()
第十,将merged1与merged连接,将na填充为0(因为最小发行日期的状态始终为0)并按部门,员工和发布日期排序:
# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
'Employee',
'Issue Date'])
第十一,内部将合并后的数据框与Department,Employee和Issue Date上的连接数据框合并,然后删除分组计数:
# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
how = 'inner',
on = ['Department',
'Employee',
'Issue Date']).drop('grouped count',
axis = 1)
瞧!这是您的最终数据框:
# Final df
final