我有一个如下数据框
+-------+-------------+----------+
|manager|employee name| position|
+-------+-------------+----------+
| A| A1| Associate|
| A| A2|Contractor|
| A| A3| Associate|
| A| A4| Associate|
| B| B1|Contractor|
| B| B2| Associate|
| B| B3|Contractor|
+-------+-------------+----------+
我想找到每位经理下的员工和承包商总数。因此,结果df看起来像
+-------+---------------+----------------+
|manager|Associate Count|Contractor Count|
+-------+---------------+---------------+|
| A| 3| 1|
| B| 1| 2|
+-------+---------------+----------------+
答案 0 :(得分:3)
在“位置”列上使用简单的pivot
和count("position")
以及import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(
("A", "A1", "Associate"),
("A", "A2", "Contractor"),
("A", "A3", "Associate"),
("A", "A4", "Associate"),
("B", "B1", "Contractor"),
("B", "B2", "Associate"),
("B", "B3", "Contractor")
).toDF("manager", "employee", "position")
df.groupBy("manager").pivot("position").agg(count("position")).show
// +-------+---------+----------+
// |manager|Associate|Contractor|
// +-------+---------+----------+
// | B| 1| 2|
// | A| 3| 1|
// +-------+---------+----------+
会产生所需的结果:
library(dplyr)
答案 1 :(得分:2)
按pivot
分组后,您可以按位置manager
进行计数:
df.groupBy($"manager")
.pivot("position")
.agg(count("position"))
.show