将spark scala列中的数据转换为行

时间:2017-07-28 11:17:19

标签: scala apache-spark dataframe transform

输入数据集:

CustomerID CustomerName Sun Mon Tue
   1          ABC        0   12  10
   2          DEF        10   0   0

必需的输出数据集:

CustomerID CustomerName Day  Value
    1         ABC       Sun   0
    1         ABC       Mon   12
    1         ABC       Tue   10
    2         DEF       Sun   10
    2         DEF       Mon   0
    2         DEF       Tue   0

请注意,我的数据集中“Sun Mon Tue”列的数量是82!

1 个答案:

答案 0 :(得分:2)

假设您的输入dataset是使用case class生成的

case class infos(CustomerID: Int, CustomerName: String, Sun: Int, Mon: Int, Tue: Int)

出于测试目的,我正在创建一个dataset

import sqlContext.implicits._
val ds = Seq(
  infos(1, "ABC", 0, 12, 10),
  infos(2, "DEF", 10, 0, 0)
).toDS

应该提供您的输入dataset

+----------+------------+---+---+---+
|CustomerID|CustomerName|Sun|Mon|Tue|
+----------+------------+---+---+---+
|1         |ABC         |0  |12 |10 |
|2         |DEF         |10 |0  |0  |
+----------+------------+---+---+---+

获取最终要求dataset要求您创建另一个case class

case class finalInfos(CustomerID: Int, CustomerName: String, Day: String, Value: Int)

通过执行以下操作可以实现最终所需的dataset

val names = ds.schema.fieldNames

ds.flatMap(row => Array(finalInfos(row.CustomerID, row.CustomerName, names(2), row.Sun),
  finalInfos(row.CustomerID, row.CustomerName, names(3), row.Mon),
  finalInfos(row.CustomerID, row.CustomerName, names(4), row.Tue)))

应该为dataset提供

+----------+------------+---+-----+
|CustomerID|CustomerName|Day|Value|
+----------+------------+---+-----+
|1         |ABC         |Sun|0    |
|1         |ABC         |Mon|12   |
|1         |ABC         |Tue|10   |
|2         |DEF         |Sun|10   |
|2         |DEF         |Mon|0    |
|2         |DEF         |Tue|0    |
+----------+------------+---+-----+