我想找到组织中员工的顶层层次结构,并使用pyspark分配报告级别?
我们已经使用spark GraphX在Scala支持下解决了此问题。我们想在python中执行此操作,但不使用Graphframes(DFs为首选项)。是否可以使用火花DF做到这一点?如果没有,那么我们将使用Graphframes。
有2个DF,即employee_df和required_hierarchy_df
请参考以下示例:
required_hierarchy_df:
employee_id | designation | supervisor_id | supervisor_designation
10 | Developer | 05 | Techincal Lead
employee_df:
employee_id | designation | supervisor_id | supervisor_designation
10 | Developer | 05 | Techincal Lead
05 | Technical Lead | 04 | Manager
04 | Director | 03 | Sr. Director
03 | Sr. Director| 02 | Chairman
02 | Chairman | 01 | CEO
01 | CEO | null | null
预期的输出:
员工的报告级别:
report_level_df:
employee_id | level_1_id | level_2_id | level_3_id | level_4_id | level_5_id
10 | 05 | 04 | 03 | 02 | 01
组织中的顶级层次结构信息:
top_level_df:
employee_id | designation | top_level_id | top_level_designation
10 | Developer | 01 | CEO
答案 0 :(得分:1)
考虑不使用spark作为仅有的200万行。使用类似dict / graph / tree的数据结构使此操作非常简单。我建议不使用Spark DataFrame进行此操作。
使用Spark数据框,您可以通过递归联接来解决此问题,创建数据框report_level_df
。这不是一个不错的和/或有效的解决方案
我们对员工-主管关系感兴趣。
edges = employee_df.select('employee_id', 'supervisor_id')
可以说,迈出第一步需要一个单一的加入
level_0 = edges \
.withColumnRenamed('employee_id', 'level_0') \
.withColumnRenamed('supervisor_id', 'level_1')
level_1 = edges \
.withColumnRenamed('employee_id', 'level_1') \
.withColumnRenamed('supervisor_id', 'level_2')
# Join, sort columns and show
level_0 \
.join(level_1, on='level_1') \
.select('level_0', 'level_1', 'level_2') \
.show()
我们想递归地遍历整个链。
total = edges \
.withColumnRenamed('employee_id', 'level_0') \
.withColumnRenamed('supervisor_id', 'level_1')
levels = 10
for i in range(1, levels):
level_i = edges \
.withColumnRenamed('employee_id', 'level_{}'.format(i)) \
.withColumnRenamed('supervisor_id', 'level_{}'.format(i+1))
total = total \
.join(level_i, on='level_{}'.format(i), how='left')
# Sort columns and show
total \
.select(['level_{}'.format(i) for i in range(levels)]) \
.show()
除了我们不想猜测级别数之外,因此我们每次都检查是否完成。这需要运行所有数据,因此很慢。
schema = 'employee_id int, supervisor_id int'
edges = spark.createDataFrame([[10, 5], [5, 4], [4, 3], [3, 2], [2, 1], [1, None]], schema=schema)
total = edges \
.withColumnRenamed('employee_id', 'level_0') \
.withColumnRenamed('supervisor_id', 'level_1')
i = 1
while True:
this_level = 'level_{}'.format(i)
next_level = 'level_{}'.format(i+1)
level_i = edges \
.withColumnRenamed('employee_id', this_level) \
.withColumnRenamed('supervisor_id', next_level)
total = total \
.join(level_i, on=this_level, how='left')
if total.where(f.col(next_level).isNotNull()).count() == 0:
break
else:
i += 1
# Sort columns and show
total \
.select(['level_{}'.format(i) for i in range(i+2)]) \
.show()
结果
+-------+-------+-------+-------+-------+-------+-------+
|level_5|level_4|level_3|level_2|level_1|level_0|level_6|
+-------+-------+-------+-------+-------+-------+-------+
| null| null| null| null| null| 1| null|
| null| null| null| null| 1| 2| null|
| null| null| null| 1| 2| 3| null|
| null| null| 1| 2| 3| 4| null|
| null| 1| 2| 3| 4| 5| null|
| 1| 2| 3| 4| 5| 10| null|
+-------+-------+-------+-------+-------+-------+-------+