使用Spark 1.6(CDH 5.9.2)...注意此代码:
println("r1:")
r1.show(20,false)
r1.printSchema()
val r2 = r1.groupBy('cid, 'leadR).agg(collect_set('follower) as "followRz")
println("r2:")
r2.show(20,false)
r2.printSchema()
使用此案例类(适用于r1
中间和最后一列):
case class UuidWrapper(
id : java.lang.String,
lastSeenDate : java.lang.Long,
firstSeenDate : java.lang.Long
)
打算使用不同的列名来获取此案例类的架构(对于整个r2
):
case class UuidRelationships(
clientId : java.lang.Long,
leader : UuidWrapper,
followers : Array[UuidWrapper]
)
前面的代码产生以下输出:
r1:
+---+------------+-----------+
|cid|leadR |follower |
+---+------------+-----------+
|5 |[55,555,555]|[2,555,555]|
|5 |[55,555,555]|[3,555,555]|
|7 |[66,777,777]|[5,777,777]|
|7 |[66,777,777]|[6,777,777]|
+---+------------+-----------+
root
|-- cid: long (nullable = true)
|-- leadR: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- lastSeenDate: long (nullable = true)
| |-- firstSeenDate: long (nullable = true)
|-- follower: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- lastSeenDate: long (nullable = true)
| |-- firstSeenDate: long (nullable = true)
r2:
+---+------------+--------------------------+
|cid|leadR |followRz |
+---+------------+--------------------------+
|7 |[66,777,777]|[[5,777,777], [6,777,777]]|
|5 |[55,555,555]|[[2,555,555], [3,555,555]]|
+---+------------+--------------------------+
root
|-- cid: long (nullable = true)
|-- leadR: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- lastSeenDate: long (nullable = true)
| |-- firstSeenDate: long (nullable = true)
|-- followRz: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- lastseendate: long (nullable = true)
| | |-- firstseendate: long (nullable = true)
使用lastSeenDate
时,为什么firstSeenDate
和lastseendate
会更改为全部小写(firstseendate
和collect_set
)?
注意:我的解决方法是使用小写...但为什么?