Question

我有一个Spark DataFrame，如下所示：

root
|-- employeeName: string (nullable = true)
|-- employeeId: string (nullable = true)
|-- employeeEmail: string (nullable = true)
|-- company: struct (nullable = true)
|    |-- companyName: string (nullable = true)
|    |-- companyId: string (nullable = true)
|    |-- details: struct (nullable = true)
|    |    |-- founded: string (nullable = true)
|    |    |-- address: string (nullable = true)
|    |    |-- industry: string (nullable = true)

我想要做的是按groupId分组并获得每个公司的一系列员工，如下所示：

root
|-- company: struct (nullable = true)
|    |-- companyName: string (nullable = true)
|    |-- companyId: string (nullable = true)
|    |-- details: struct (nullable = true)
|    |    |-- founded: string (nullable = true)
|    |    |-- address: string (nullable = true)
|    |    |-- industry: string (nullable = true)
|-- employees: array (nullable = true)     
|    |-- employee: struct (nullable = true)           
|    |    |-- employeeName: string (nullable = true)
|    |    |-- employeeId: string (nullable = true)
|    |    |-- employeeEmail: string (nullable = true)

当然，如果我只有一对（公司，员工）:( String，String）使用map和reduceByKey，我可以很容易地做到这一点。但是对于所有不同的嵌套信息，我不确定采取什么方法。

我应该尝试压扁一切吗？任何做类似事情的例子都会非常有用。

Answer 1

您可以执行以下操作 -

// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)

// setting up example data
val e1 = Employee("n1", "1", "n1@c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2@c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3@c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4@c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", "5", "n5@c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6@c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7@c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8@c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val ds = sc.parallelize(employees).toDS

// actual query to achieve what is mentioned in the question
val result = ds.groupByKey(e => e.company).mapGroups((k, itr) => (k, itr.toList))
result.collect

结果：

Array(

(Company(c1,1,d1),WrappedArray(Employee(n1,1,n1@c1.com,Company(c1,1,d1)), Employee(n2,2,n2@c1.com,Company(c1,1,d1)), Employee(n3,3,n3@c1.com,Company(c1,1,d1)))),

(Company(c2,2,d2),WrappedArray(Employee(n4,4,n4@c2.com,Company(c2,2,d2)), Employee(n5,5,n5@c2.com,Company(c2,2,d2)), Employee(n6,6,n6@c2.com,Company(c2,2,d2)))), 

(Company(c3,3,d3),WrappedArray(Employee(n7,7,n7@c3.com,Company(c3,3,d3)), Employee(n8,8,n8@c3.com,Company(c3,3,d3)))))

重要的是：您可以在mapGroups中传递任何您想要的功能，以便以您想要的方式获取群组。

希望这有帮助。

嵌套结构

1 个答案: