Spark:行到列(如转置或转动)

时间:2016-10-01 17:41:14

标签: scala apache-spark apache-spark-sql pivot user-defined-functions

如何使用RDD或没有数据透视的数据框将行转置为列。

SessionId,date,orig, dest, legind, nbr

1   9/20/16,abc0,xyz0,o,1
1   9/20/16,abc1,xyz1,o,2
1   9/20/16,abc2,xyz2,i,3
1   9/20/16,abc3,xyz3,i,4

所以我想生成新的架构,如:

SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4

1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null

逻辑是:

  • nbr为1且legind = o然后是orig1值(从第1行取指)......

  • nbr是3,legind = i然后是dest1值(从第3行获取)

那么如何将行转置为列......

任何想法都会非常感激。

尝试使用以下选项,但它只是在单行中展平..

val keys = List("SessionId");
val selectFirstValueOfNoneGroupedColumns =
  df.columns
    .filterNot(keys.toSet)
    .map(_ -> "first").toMap
val grouped =
  df.groupBy(keys.head, keys.tail: _*)
    .agg(selectFirstValueOfNoneGroupedColumns).show()

1 个答案:

答案 0 :(得分:1)

如果使用pivot函数,则相对简单。首先,让我们创建一个类似问题的数据集:

import org.apache.spark.sql.functions.{concat, first, lit, when}

val df = Seq(
  ("1", "9/20/16", "abc0", "xyz0", "o", "1"),
  ("1", "9/20/16", "abc1", "xyz1", "o", "2"),
  ("1", "9/20/16", "abc2", "xyz2", "i", "3"),
  ("1", "9/20/16", "abc3", "xyz3", "i", "4")
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr")

然后定义并附加帮助列:

// This will be the column name
val key = when($"legind" === "o", concat(lit("orig"), $"nbr"))
           .when($"legind" === "i", concat(lit("dest"), $"nbr"))

// This will be the value
val value = when($"legind" === "o", $"orig")     // If o take origin
              .when($"legind" === "i", $"dest")  // If i take dest

val withKV =  df.withColumn("key", key).withColumn("value", value)

这将导致DataFrame这样:

+---------+-------+----+----+------+---+-----+-----+
|SessionId|   date|orig|dest|legind|nbr|  key|value|
+---------+-------+----+----+------+---+-----+-----+
|        1|9/20/16|abc0|xyz0|     o|  1|orig1| abc0|
|        1|9/20/16|abc1|xyz1|     o|  2|orig2| abc1|
|        1|9/20/16|abc2|xyz2|     i|  3|dest3| xyz2|
|        1|9/20/16|abc3|xyz3|     i|  4|dest4| xyz3|
+---------+-------+----+----+------+---+-----+-----+

接下来让我们定义一个可能的级别列表:

val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y"))

最后pivot

val result = withKV
  .groupBy($"sessionId", $"date")
  .pivot("key", levels)
  .agg(first($"value", true)).show

result是:

+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|sessionId|   date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|        1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+