是否有代码在熊猫中完成此操作?或许我应该尝试SQL(我对SQL不是很熟悉)。这是我到目前为止(假的例子,真正的一个有大约20,000个人)
employee_df =
EmpID Name Title ManagerID
abc123 John Head pqr456
pqr456 Jake VP bs92999
zyx987 Jill Lead abc123
bs92999 Bob SVP NaN
复制数据框: manager_df = employee_df
roster = pd.merge(manager_df, employee_df, how='outer', left_on ='ManagerID', right_on = 'EmpID')
我的输出很混乱,虽然看似正确(它告诉我经理是谁没有通过引用自己的单独的表)。
EmpID_x Name_x Title_x ManagerID_x EmpID_y Name_y Title_y ManagerID_y
abc123 John Head pqr456 pqr456 Jake VP bs92999
pqr456 Jake VP bs92999 bs92999 Bob SVP NaN
zyx987 Jill Lead abc123 abc123 John Head pqr456
bs92999 Bob SVP NaN NaN NaN NaN NaN
NaN NaN NaN NaN zyx987 Jill Lead abc123
最常见的期望输出是:
EmpID | Name | Title | Manager_Name
但有时我也需要另一个级别(老板的老板),最大潜力大约5层看起来很疯狂但是有很多层次结构 - 这样的高层并不是一直都是必要的但是我如果需要,我希望能够将这些数据提升到更高的水平:
EmpID | Name | Title | Manager_Name_1 | Manager_Name_2
第三个数据框是reporting_df:
EmpID | ManagerID | StartDate | EndDate
有时管理层的更改甚至会在月中发生,结果应该反映在拥有'那个雇员那天。
file =具有EmpID的任何文件或报告,我可能想要找出管理员(或他们的经理)在该日期上也包含在文件中的人。这是解决这个问题的正确方法吗?
for i in range(len(file)):
file.ix[i,'Manager'] = reporting_df[(reporting_df.StartDate.shift(-1) > file.StartDate[i]) &(reporting_df.StartDate <= file.Date[i])]
答案 0 :(得分:3)
这部分可能有点棘手,所以让我们按步骤构建它。首先,让我们稍微重命名列以便以后更轻松(只需在三列中添加&#39; _0&#39;)
EmpID Name_0 Title_0 ManagerID_0
0 abc123 John Head pqr456
1 pqr456 Jake VP bs92999
2 zyx987 Jill Lead abc123
3 bs92999 Bob SVP NaN
这里的主要技巧是我们需要一个映射,可以通过一系列来完成:
df.set_index('EmpID')['Name_0']
关键是我们设置了EmpID&#39;作为索引,然后给我们一个映射来自&#39; EmpID&#39;到&#39; Name_0&#39;我们也可以为'Title_0&#39;和&#39; ManagerID_0&#39;。
尝试一栏:
df['ManagerID_0'].map( df.set_index('EmpID')['Name_0'] )
0 Jake
1 Bob
2 John
3 NaN
现在只需包含一个&#39; for&#39;获得完整版本:
for i in range(3):
for col in ['Name_','Title_','ManagerID_']:
df[col+str(i+1)] = df['ManagerID_'+str(i)].map(
df.set_index('EmpID')[col+'0'] )
EmpID Name_0 Title_0 ManagerID_0 Name_1 Title_1 ManagerID_1 Name_2
0 abc123 John Head pqr456 Jake VP bs92999 Bob
1 pqr456 Jake VP bs92999 Bob SVP NaN NaN
2 zyx987 Jill Lead abc123 John Head pqr456 Jake
3 bs92999 Bob SVP NaN NaN NaN NaN NaN
Title_2 ManagerID_2 Name_3 Title_3 ManagerID_3
0 SVP NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 VP bs92999 Bob SVP NaN
3 NaN NaN NaN NaN NaN
我将范围设置为3,因为它有&#39; ManagerID_3&#39;在NaN为每个人,但如果你有更多的水平,你可以设置更高的当然。
答案 1 :(得分:0)
你可以在pandas中让join使用merge函数
x = new_df2[['EmpID', 'ManagerID', 'Name']].merge(new_df2[['EmpID', 'ManagerID', 'Name']],
left_on='ReportsTo', right_on='EmployeeID', how='left')
x[['EmpID_x', 'Name_x', 'Name_y']].sort_values(by='Name_y') # sort by manager name
x.rename(columns={"Name_x": "Employee_Name", "Name_y": "Manager_Name"}, inplace=True)