使用flatmap或reducebyKey

时间:2018-07-12 18:29:14

标签: scala apache-spark

我需要折叠行并包装包装。这是原始数据和预期结果。需要在spark scala中完成。

原始数据:

Column1   COlumn2   Units   UnitsByDept
ABC       BCD       3       [Dept1:1,Dept2:2]
ABC       BCD       13      [Dept1:5,Dept3:8]

预期结果:

ABC       BCD       16       [Dept1:6,Dept2:2,Dept3:8]

1 个答案:

答案 0 :(得分:0)

最好使用DataFrame API来满足您的需求。如果您更喜欢使用诸如reduceByKey之类的基于行的函数,则可以采用以下一种方法:

  1. 将数据框转换为PairRDD
  2. 应用reduceByKey来汇总单位和部门汇总的ByBept单位
  3. 将生成的RDD转换回DataFrame:

下面的示例代码:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row

val df = Seq(
  ("ABC", "BCD", 3, Seq("Dept1:1", "Dept2:2")),
  ("ABC", "BCD", 13, Seq("Dept1:5", "Dept3:8"))
).toDF("Column1", "Column2", "Units", "UnitsByDept")

val rdd = df.rdd.
  map{ case Row(c1: String, c2: String, u: Int, ubd: Seq[String]) =>
    ((c1, c2), (u, ubd))
  }.
  reduceByKey( (acc, t) => (acc._1 + t._1, acc._2 ++ t._2) ).
  map{ case ((c1, c2), (u, ubd)) => 
    val aggUBD = ubd.map(_.split(":")).map(arr => (arr(0), arr(1).toInt)).
      groupBy(_._1).mapValues(_.map(_._2).sum).
      map{ case (d, u) => d + ":" + u }
    ( c1, c2, u, aggUBD)
  }

rdd.collect
// res1: Array[(String, String, Int, scala.collection.immutable.Iterable[String])] =
//   Array((ABC,BCD,16,List(Dept3:8, Dept2:2, Dept1:6)))  

val rowRDD = rdd.map{ case (c1: String, c2: String, u: Int, ubd: Array[String]) =>
  Row(c1, c2, u, ubd)
}

val dfResult = spark.createDataFrame(rowRDD, df.schema)

dfResult.show(false)
// +-------+-------+-----+---------------------------+
// |Column1|Column2|Units|UnitsByDept                |
// +-------+-------+-----+---------------------------+
// |ABC    |BCD    |16   |[Dept3:8, Dept2:2, Dept1:6]|
// +-------+-------+-----+---------------------------+