Pandas根据日期添加列

时间:2017-10-24 19:18:07

标签: python pandas

一个人或“工人”可以随着时间的推移进行多项职位分配(工作)。每个职位分配都有一个有效的,有效的。我想在每个月的16号为工人获得FullTimeEquivalency。

DataFrame df3包含人员列表 DataFrame df4包含一个位置列表 df3.RecId_x等于df4.Worker,应该用于在两个数据帧之间建立链接。

df3.columns 
Out[81]: Index(['PersonnelNumber', 'Person', 'RecId_x', 'NameAlias'], dtype='object')

df4.columns
    Out[82]: 
    Index(['Worker', 'Position', 'ValidFrom_ass', 'ValidTo_ass', 'Description',
           'FullTimeEquivalency', 'Department'],
          dtype='object')

示例:

df3.head(2)
Out[84]: 
  PersonnelNumber      Person     RecId_x             NameAlias
0          2  5637162883  5637144780       Mr A
1          6  5637162893  5637144784  Mr B



df4[df4['Worker']==5637144780]
Out[86]: 
         Worker    Position       ValidFrom_ass         ValidTo_ass  \
793  5637144780  5637158077 2017-01-01 01:00:00 2017-02-20 00:59:59   
875  5637144780  5637158076 2017-02-21 01:00:00 2020-01-10 00:59:59   

    Description  FullTimeEquivalency  Department  
793    Position1                  1.0  5637336774  
875    Position2                  0.9  5637336774  

目标:

我的目标是添加到df3列'jan_fte','feb_fte',... 列出每个工人当月的fte。

我的尝试:

df3['Jan_fte']= df4[(df4['Worker']==df3.RecId_x) & (df4['ValidFrom_ass'] <= '2017-01-16') & (df4['ValidTo_ass'] >='2017-01-16')]

ValueError: Can only compare identically-labeled Series objects

DF3 NameAlias,jan_fte,feb_fte,mar_fte,..

A先生,1.0,1.0,0.9,..

jan_fte有1.0,因为A先生在16-01-2017被分配到职位1,其中FullTimeEquivalency 1.0 feb_fte有1.0,因为A先生在16-02-2017被分配到位置1,FullTimeEquivalency 1.0 mar_fte为0.9,因为A先生在16-03-2017被分配到职位2,其中FullTimeEquivalency 0.9

复制数据:

import pandas as pd
#df3 dict as df8
df8 = pd.DataFrame({'NameAlias': {0: 'Mr A', 1: 'Mr B'},
 'Person': {0: 5637162883, 1: 5637162893},
 'PersonnelNumber': {0: '2', 1: '6'},
 'RecId_x': {0: 5637144780, 1: 5637144784}})

#df4 filtered on worker 5637144780 dict as df9:
df9 = pd.DataFrame({'Department': {793: 5637336774, 875: 5637336774},
 'Description': {793: 'Position 1', 875: 'Position 2'},
 'FullTimeEquivalency': {793: 1.0, 875: 0.90000000000000002},
 'Position': {793: 5637158077, 875: 5637158076},
 'ValidFrom_ass': {793: pd.Timestamp('2017-01-01 01:00:00'),
  875: pd.Timestamp('2017-02-21 01:00:00')},
 'ValidTo_ass': {793: pd.Timestamp('2017-02-20 00:59:59'),
  875: pd.Timestamp('2020-01-10 00:59:59')},
 'Worker': {793: 5637144780, 875: 5637144780}})

1 个答案:

答案 0 :(得分:0)

我找到了达到预期效果的方法。

## SETUP DATA TO REPRODUCE:
import pandas as pd
from pandas import Timestamp
#df3 dict as df8
df8 = pd.DataFrame({'NameAlias': {0: 'anonymous',
  1: 'anonymous',
  2: 'anonymous',
  3: 'anonymous',
  4: 'anonymous'},
 'Person': {0: 5637163197,
  1: 5637198703,
  2: 5637336887,
  3: 5637191544,
  4: 5637163123},

 'RecId_x': {0: 5637144954,
  1: 5637145759,
  2: 5637163507,
  3: 5637145684,
  4: 5637144903}})

#df4 as df9:
df9 = pd.DataFrame({'FullTimeEquivalency': {202: 1.0,
  252: 0.80000000000000004,
  255: 0.80000000000000004,
  258: 0.80000000000000004,
  354: 1.0,
  386: 1.0,
  639: 0.80000000000000004,
  690: 0.0,
  696: 1.0,
  731: 1.0},
 'ValidFrom_ass': {202: Timestamp('2015-11-01 01:00:00'),
  252: Timestamp('2010-01-01 01:00:00'),
  255: Timestamp('2010-01-02 01:00:00'),
  258: Timestamp('2016-01-01 01:00:00'),
  354: Timestamp('2010-01-01 01:00:00'),
  386: Timestamp('2010-09-21 02:00:00'),
  639: Timestamp('2015-01-01 01:00:00'),
  690: Timestamp('2014-04-01 02:00:00'),
  696: Timestamp('2015-01-26 01:00:00'),
  731: Timestamp('2017-05-01 02:00:00')},
 'ValidFrom_pos': {202: Timestamp('2015-11-01 01:00:00'),
  252: Timestamp('2010-01-01 01:00:00'),
  255: Timestamp('2010-01-02 01:00:00'),
  258: Timestamp('2016-01-01 01:00:00'),
  354: Timestamp('2010-01-01 01:00:00'),
  386: Timestamp('2010-09-21 02:00:00'),
  639: Timestamp('2015-01-01 01:00:00'),
  690: Timestamp('2014-04-01 02:00:00'),
  696: Timestamp('2015-01-26 01:00:00'),
  731: Timestamp('2017-05-01 02:00:00')},
 'ValidTo_ass': {202: Timestamp('2154-12-31 00:59:59'),
  252: Timestamp('2010-01-02 00:59:59'),
  255: Timestamp('2016-01-01 00:59:59'),
  258: Timestamp('2154-12-31 00:59:59'),
  354: Timestamp('2010-09-21 01:59:59'),
  386: Timestamp('2154-12-31 00:59:59'),
  639: Timestamp('2154-12-31 00:59:59'),
  690: Timestamp('2015-01-26 00:59:59'),
  696: Timestamp('2017-05-01 01:59:59'),
  731: Timestamp('2154-12-31 00:59:59')},
 'ValidTo_pos': {202: Timestamp('2154-12-31 00:59:59'),
  252: Timestamp('2010-01-02 00:59:59'),
  255: Timestamp('2016-01-01 00:59:59'),
  258: Timestamp('2154-12-31 00:59:59'),
  354: Timestamp('2010-09-21 01:59:59'),
  386: Timestamp('2154-12-31 00:59:59'),
  639: Timestamp('2154-12-31 00:59:59'),
  690: Timestamp('2015-01-26 00:59:59'),
  696: Timestamp('2017-05-01 01:59:59'),
  731: Timestamp('2154-12-31 00:59:59')},
 'Worker': {202: 5637163507,
  252: 5637144903,
  255: 5637144903,
  258: 5637144903,
  354: 5637144954,
  386: 5637144954,
  639: 5637145684,
  690: 5637145759,
  696: 5637145759,
  731: 5637145759}})

print('--Dataframe df8--')
print(df8)
print('--Dataframe df9--')
print(df9)

#SOLUTION:

cols = list()
dr = pd.date_range(start='2017-01-01', 
                   end='2017-12-31',
                   freq='MS'
                   ).shift(15, freq='D')

for date in dr:
    format_date = date.strftime('%b')
    cols.append(format_date)
    for wkr in df8.RecId_x.values:

        try:
            val = df9[(df9['Worker']==wkr) & 
                      (df9['ValidFrom_ass'] <= date) & 
                      (df9['ValidTo_ass'] >= date) & 
                      (df9['ValidFrom_pos'] <= date) & 
                      (df9['ValidTo_pos'] >= date)].FullTimeEquivalency.sum()

        except:
            val = 0.0

        df8.loc[df8.RecId_x==wkr,format_date] = val
df8['mean_fte'] = df8[cols].mean(axis=1)

print('--Desired output:--')
print(df8)