spark scala创造关键值对

时间:2017-09-07 04:21:44

标签: scala apache-spark

我的架构如:

       root
      |-- id: string (nullable = true)
      |-- info: array (nullable = true)
      |    |-- element: struct (containsNull = true)
      |    |    |-- _1: string (nullable = true)
      |    |    |-- _2: long (nullable = false)
      |    |    |-- _3: string (nullable = true)

Info是一个结构数组。我想将info_1作为关键字,info_2info_3作为id分组后的值。所以o / p应该是这样的:

id,[[info[0]_1:{info[0]_2,info[0]_3}],[info[1]_1:{{info[1]_2,info[1]_3},...]

请帮助。

1 个答案:

答案 0 :(得分:0)

这应该让你开始(UDF方法):

val df = Seq(
      ("1", Seq(("a", 1L, "b"), ("c", 2L, "d"))
  )
).toDF("id", "info")


df.show()

+---+------------------+
| id|              info|
+---+------------------+
|  1|[[a,1,b], [c,2,d]]|
+---+------------------+


val transformStructToMap = udf((structarray : Seq[Row]) => {
    structarray.map(r =>
        (r.getString(0), // key
        (r.getLong(1),r.getString(2))) // values
     ).toMap
 })

df.select(
   $"id",
   transformStructToMap($"info").as("info")
 ).show()

+---+---------------------------+
|id |info                       |
+---+---------------------------+
|1  |Map(a -> [1,b], c -> [2,d])|
+---+---------------------------+

我真的不明白“分组后”是什么意思。如果要在按ID分组后连接数组,则需要collect_list然后使用udf首先连接(并展平)数组