使用键列火花展平记录

时间:2017-10-17 20:18:27

标签: scala apache-spark-sql spark-dataframe rdd

我正在尝试使用spark / Scala API实现逻辑以展平记录。我正在尝试使用地图功能。

你能帮助我解决这个问题的最简单方法吗?

假设,对于给定的密钥,我需要有3个过程代码

输入数据框 - >

Keycol|processcode
John  |1
Mary  |8
John  |2
John  |4
Mary  |1
Mary  |7

==============================

输出数据框 - >

Keycol|processcode1|processcode2|processcode3
john  |1           |2           |4
Mary  |8           |1           |7

2 个答案:

答案 0 :(得分:1)

假设每Keycol行的行数相同,一种方法是将processcode聚合到每个Keycol的数组中,然后展开到各个列中:

val df = Seq(
  ("John", 1),
  ("Mary", 8),
  ("John", 2),
  ("John", 4),
  ("Mary", 1),
  ("Mary", 7)
).toDF("Keycol", "processcode")

val df2 = df.groupBy("Keycol").agg(collect_list("processcode").as("processcode"))

val numCols = df2.select( size(col("processcode")) ).as[Int].first
val cols = (0 to numCols - 1).map( i => col("processcode")(i) )

df2.select(col("Keycol") +: cols: _*).show

+------+--------------+--------------+--------------+
|Keycol|processcode[0]|processcode[1]|processcode[2]|
+------+--------------+--------------+--------------+
|  Mary|             8|             1|             7|
|  John|             1|             2|             4|
+------+--------------+--------------+--------------+

答案 1 :(得分:1)

一些替代方法。

<强> SQL

df.createOrReplaceTempView("tbl")

val q = """
select keycol,
       c[0] processcode1,
       c[1] processcode2,
       c[2] processcode3
  from (select keycol, collect_list(processcode) c
          from tbl
        group by keycol) t0
"""

sql(q).show

结果

scala> sql(q).show
+------+------------+------------+------------+
|keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
|  Mary|           1|           7|           8|
|  John|           4|           1|           2|
+------+------------+------------+------------+

PairRDDFunctions(groupByKey)+ mapPartitions

import org.apache.spark.sql.Row
val my_rdd = df.map{ case Row(a1: String, a2: Int) => (a1, a2)
                   }.rdd.groupByKey().map(t => (t._1, t._2.toList))

def f(iter: Iterator[(String, List[Int])]) : Iterator[Row] = {
  var res = List[Row]();
  while (iter.hasNext) {
    val (keycol: String, c: List[Int]) = iter.next    
    res = res ::: List(Row(keycol, c(0), c(1), c(2)))
  }
  res.iterator
}

import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
             StructField("Keycol", StringType, true)).add(
             StructField("processcode1", IntegerType, true)).add(
             StructField("processcode2", IntegerType, true)).add(
             StructField("processcode3", IntegerType, true))

spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show

结果

scala> spark.createDataFrame(my_rdd.mapPartitions(f, true), schema).show
+------+------------+------------+------------+
|Keycol|processcode1|processcode2|processcode3|
+------+------------+------------+------------+
|  Mary|           1|           7|           8|
|  John|           4|           1|           2|
+------+------------+------------+------------+

请注意,除非明确指定,否则在所有情况下,流程代码列中的值顺序未确定