Pandas数据框:根据日期列比较创建其他列

时间:2017-03-20 10:38:31

标签: python python-3.x pandas dataframe

假设我在Pandas数据框中保存了以下数据集 - 请注意最后一列[Status]是我想要创建的列:

Department  Employee    Issue Date  Submission Date ***Status***
A   Joe 18/05/2014  25/06/2014  0
A   Joe 1/06/2014   28/06/2014  1
A   Joe 23/06/2014  30/06/2014  2
A   Mark    1/03/2015   13/03/2015  0
A   Mark    23/04/2015  15/04/2015  0
A   William 15/07/2016  30/07/2016  0
A   William 1/08/2016   23/08/2016  0
A   William 20/08/2016  19/08/2016  1
B   Liz 18/05/2014  7/06/2014   0
B   Liz 1/06/2014   15/06/2014  1
B   Liz 23/06/2014  16/06/2014  0
B   John    1/03/2015   13/03/2015  0
B   John    23/04/2015  15/04/2015  0
B   Alex    15/07/2016  30/07/2016  0
B   Alex    1/08/2016   23/08/2016  0
B   Alex    20/08/2016  19/08/2016  1

我想根据以下条件创建一个额外的列[状态]:

  1. 对于每个独特的[部门]& [员工]组合(例如,在部门A中有三行对应于Joe),将[发布日期]从最旧到最新排序
  2. 如果当前行[发布日期]大于所有前一行[提交日期],则将[状态]标记为0;否[状态] =没有[发布日期]< [提交日期]
  3. 例如:对于部门A中的员工Joe。当[发布日期] =' 1/06 / 2014'时,前一行的[提交日期]在[发布日期]之后因此,[状态] =第2行为第1行。类似地,当[发布日期] =' 23/06 / 2014'时,第1行& 2 [提交日期]都在[发布日期]之后,因此第3行的[状态] = 2.我们需要对部门和员工的每个唯一组合执行此计算。

    • 注意:真实数据集未按显示的示例排序。

1 个答案:

答案 0 :(得分:0)

这个问题是在6个月前发布的,但希望我的回答仍能提供一些帮助。

首先,导入库并创建数据框:

# import libraries
import numpy as np
import pandas as pd

# Make DataFrame
df = pd.DataFrame({'Department' : ['A']*8 + ['B']*8,
                   'Employee' : ['Joe']*3 +\
                                ['Mark']*2 +\
                                ['William']*3 +\
                                ['Liz']*3 +\
                                ['John']*2 +\
                                ['Alex']*3,
                   'Issue Date' : ['18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016',
                                   '18/05/2014', '1/06/2014', '23/06/2014',
                                   '1/03/2015', '23/04/2015',
                                   '15/07/2016', '1/08/2016', '20/08/2016'],
                   'Submission Date' : ['25/06/2014', '28/06/2014', '30/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016',
                                        '7/06/2014', '15/06/2014', '16/06/2014',
                                        '13/03/2015', '15/04/2015',
                                        '30/07/2016', '23/08/2016', '19/08/2016']})

df

其次,将发行日期和提交日期转换为datetime:

    # Convert 'Issue Date', 'Submission Date' to pd.datetime
df.loc[:, 'Issue Date'] = pd.to_datetime(df.loc[:, 'Issue Date'],
                                         dayfirst = True)
df.loc[:, 'Submission Date'] = pd.to_datetime(df.loc[:, 'Submission Date'],
                                              dayfirst = True)

第三步,重置索引并按部门,员工和发布日期对值进行排序:

# Reset index and sort_values by 'Department', 'Employee', 'Issue Date'
df.reset_index(drop = True).sort_values(by = ['Department',
                                              'Employee',
                                              'Issue Date'],
                                        inplace = True)
第四,按部门分组,员工;累计计数行数;插入原始df:

# Group by 'Department', 'Employee'; cumulative count rows; insert into original df
df.insert(df.shape[1],
          'grouped count',
          df.groupby(['Department',
                      'Employee']).cumcount())

grouped count

第五,创建一个no_issue和no_submission数据帧,并在Department和Employee上将它们合并在一起:

# Create df without 'Issue Date'
no_issue = df.drop('Issue Date', axis = 1)

# Create df without 'Submission Date'
no_submission = df.drop('Submission Date', axis = 1)

# Outer merge no_issue with no_submission on 'Department', 'Employee'
merged = no_issue.merge(no_submission,
                        how = 'outer',
                        on = ['Department',
                              'Employee'])

这会将提交日期与每个部门,员工组的发行日期数量

重复

以下是Joe的样子:

merged

第六,创建一个数据框,只保留分组count_x小于分组count_y的行,然后按部门,员工和发布日期排序:

# Create merged1 df that keeps only rows where 'grouped count_x' < 'grouped count_y';
# sort by 'Department', 'Employee', 'Issue Date
merged1 = merged[merged.loc[:, 'grouped count_x'] <
                 merged.loc[:, 'grouped count_y']].sort_values(by = ['Department',
                                                                     'Employee',
                                                                     'Issue Date'])

第七,将状态列插入布尔值,其中发布日期小于提交日期:

# Insert 'Status' as a boolean when 'Issue Date' < 'Submission Date'
merged1.insert(merged.shape[1],
               'Status',
               merged1.loc[:, 'Issue Date'] < merged1.loc[:, 'Submission Date'])

第八,按部门,员工和发行日期分组,对状态求和,并重置索引:

# Group by 'Department', 'Employee', 'Issue Date' and sum 'Status'; reset index
merged1 = merged1.groupby(['Department',
                           'Employee',
                           'Issue Date']).agg({'Status' : np.sum}).reset_index()

这将返回一个数据框,其中包含所有正确的状态,减去每个部门,员工组的最短发布日期

status

第九,按部门和员工对原始合并数据框进行分组,找到最小发行日期,并重置索引:

# Group merged by 'Department', 'Employee' and find min 'Issue Date'; reset index
merged = merged.groupby(['Department',
                         'Employee']).agg({'Issue Date' : 'min'}).reset_index()

第十,将merged1与merged连接,将na填充为0(因为最小发行日期的状态始终为0)并按部门,员工和发布日期排序:

# Concatenate merged with merged1; fill na with 0; sort by 'Department', 'Employee', 'Issue Date'
concatenated = pd.concat([merged1, merged]).fillna(0).sort_values(by = ['Department',
                                                                        'Employee',
                                                                        'Issue Date'])

第十一,内部将合并后的数据框与Department,Employee和Issue Date上的连接数据框合并,然后删除分组计数:

# Merge concatenated with df; drop grouped count
final = df.merge(concatenated,
                 how = 'inner',
                 on = ['Department',
                       'Employee',
                       'Issue Date']).drop('grouped count',
                                           axis = 1)

瞧!这是您的最终数据框:

# Final df
final

final