我有一个Spark DataFrame,如下所示:
root
|-- employeeName: string (nullable = true)
|-- employeeId: string (nullable = true)
|-- employeeEmail: string (nullable = true)
|-- company: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: string (nullable = true)
| |-- details: struct (nullable = true)
| | |-- founded: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- industry: string (nullable = true)
我想要做的是按groupId分组并获得每个公司的一系列员工,如下所示:
root
|-- company: struct (nullable = true)
| |-- companyName: string (nullable = true)
| |-- companyId: string (nullable = true)
| |-- details: struct (nullable = true)
| | |-- founded: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- industry: string (nullable = true)
|-- employees: array (nullable = true)
| |-- employee: struct (nullable = true)
| | |-- employeeName: string (nullable = true)
| | |-- employeeId: string (nullable = true)
| | |-- employeeEmail: string (nullable = true)
当然,如果我只有一对(公司,员工):( String,String)使用map和reduceByKey,我可以很容易地做到这一点。但是对于所有不同的嵌套信息,我不确定采取什么方法。
我应该尝试压扁一切吗?任何做类似事情的例子都会非常有用。
答案 0 :(得分:1)
您可以执行以下操作 -
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", "1", "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", "5", "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val ds = sc.parallelize(employees).toDS
// actual query to achieve what is mentioned in the question
val result = ds.groupByKey(e => e.company).mapGroups((k, itr) => (k, itr.toList))
result.collect
结果:
Array(
(Company(c1,1,d1),WrappedArray(Employee(n1,1,n1@c1.com,Company(c1,1,d1)), Employee(n2,2,n2@c1.com,Company(c1,1,d1)), Employee(n3,3,n3@c1.com,Company(c1,1,d1)))),
(Company(c2,2,d2),WrappedArray(Employee(n4,4,n4@c2.com,Company(c2,2,d2)), Employee(n5,5,n5@c2.com,Company(c2,2,d2)), Employee(n6,6,n6@c2.com,Company(c2,2,d2)))),
(Company(c3,3,d3),WrappedArray(Employee(n7,7,n7@c3.com,Company(c3,3,d3)), Employee(n8,8,n8@c3.com,Company(c3,3,d3)))))
重要的是:您可以在mapGroups
中传递任何您想要的功能,以便以您想要的方式获取群组。
希望这有帮助。