我需要折叠行并包装包装。这是原始数据和预期结果。需要在spark scala中完成。
原始数据:
Column1 COlumn2 Units UnitsByDept
ABC BCD 3 [Dept1:1,Dept2:2]
ABC BCD 13 [Dept1:5,Dept3:8]
预期结果:
ABC BCD 16 [Dept1:6,Dept2:2,Dept3:8]
答案 0 :(得分:0)
最好使用DataFrame API来满足您的需求。如果您更喜欢使用诸如reduceByKey
之类的基于行的函数,则可以采用以下一种方法:
reduceByKey
来汇总单位和部门汇总的ByBept单位下面的示例代码:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val df = Seq(
("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")
val rdd = df.rdd.
map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
((c1, c2), (u, ubd))
}.
reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
map{ case ((c1, c2), (u, ubd)) =>
val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
groupBy(_._1).mapValues(_.map(_._2).sum).
map{ case (d, u) => d + ":" + u }
( c1, c2, u, aggUBD)
}
rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
// Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))
val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
Row(c1, c2, u, ubd)
}
val dfResult = spark.createDataFrame(rowRDD, df.schema)
dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept |
// +-------+-------+-----+---------------------------+
// |ABC |BCD |16 |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+