Spark数据框分组并进一步基于列值求和

时间:2019-01-28 18:03:17

标签: scala apache-spark apache-spark-sql

我有一个如下数据框

+-------+-------------+----------+
|manager|employee name|  position|
+-------+-------------+----------+
|      A|           A1| Associate|
|      A|           A2|Contractor|
|      A|           A3| Associate|
|      A|           A4| Associate|
|      B|           B1|Contractor|
|      B|           B2| Associate|
|      B|           B3|Contractor|
+-------+-------------+----------+

我想找到每位经理下的员工和承包商总数。因此,结果df看起来像

+-------+---------------+----------------+
|manager|Associate Count|Contractor Count|
+-------+---------------+---------------+|
|      A|              3|               1|
|      B|              1|               2|
+-------+---------------+----------------+

2 个答案:

答案 0 :(得分:3)

在“位置”列上使用简单的pivotcount("position")以及import spark.implicits._ import org.apache.spark.sql.functions._ val df = Seq( ("A", "A1", "Associate"), ("A", "A2", "Contractor"), ("A", "A3", "Associate"), ("A", "A4", "Associate"), ("B", "B1", "Contractor"), ("B", "B2", "Associate"), ("B", "B3", "Contractor") ).toDF("manager", "employee", "position") df.groupBy("manager").pivot("position").agg(count("position")).show // +-------+---------+----------+ // |manager|Associate|Contractor| // +-------+---------+----------+ // | B| 1| 2| // | A| 3| 1| // +-------+---------+----------+ 会产生所需的结果:

library(dplyr)

答案 1 :(得分:2)

pivot分组后,您可以按位置manager进行计数:

df.groupBy($"manager")
  .pivot("position")
  .agg(count("position"))
  .show