Spark Dataframe groupBy with agg(collect_set(...))生成小写结构成员名称

时间:2018-03-17 00:09:23

标签: scala apache-spark spark-dataframe

使用Spark 1.6(CDH 5.9.2)...注意此代码:

println("r1:")
r1.show(20,false)
r1.printSchema()

val r2 = r1.groupBy('cid, 'leadR).agg(collect_set('follower) as "followRz")

println("r2:")
r2.show(20,false)
r2.printSchema()

使用此案例类(适用于r1中间和最后一列):

case class UuidWrapper(
                      id : java.lang.String,
                      lastSeenDate : java.lang.Long,
                      firstSeenDate : java.lang.Long
                    )

打算使用不同的列名来获取此案例类的架构(对于整个r2):

case class UuidRelationships(
                            clientId : java.lang.Long,
                            leader : UuidWrapper,
                            followers : Array[UuidWrapper]
                          )

前面的代码产生以下输出:

r1:
+---+------------+-----------+
|cid|leadR       |follower   |
+---+------------+-----------+
|5  |[55,555,555]|[2,555,555]|
|5  |[55,555,555]|[3,555,555]|
|7  |[66,777,777]|[5,777,777]|
|7  |[66,777,777]|[6,777,777]|
+---+------------+-----------+

root
 |-- cid: long (nullable = true)
 |-- leadR: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- lastSeenDate: long (nullable = true)
 |    |-- firstSeenDate: long (nullable = true)
 |-- follower: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- lastSeenDate: long (nullable = true)
 |    |-- firstSeenDate: long (nullable = true)

r2:
+---+------------+--------------------------+
|cid|leadR       |followRz                  |
+---+------------+--------------------------+
|7  |[66,777,777]|[[5,777,777], [6,777,777]]|
|5  |[55,555,555]|[[2,555,555], [3,555,555]]|
+---+------------+--------------------------+

root
 |-- cid: long (nullable = true)
 |-- leadR: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- lastSeenDate: long (nullable = true)
 |    |-- firstSeenDate: long (nullable = true)
 |-- followRz: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- lastseendate: long (nullable = true)
 |    |    |-- firstseendate: long (nullable = true)

使用lastSeenDate时,为什么firstSeenDatelastseendate会更改为全部小写(firstseendatecollect_set)?

注意:我的解决方法是使用小写...但为什么?

0 个答案:

没有答案