如何使用RDD或没有数据透视的数据框将行转置为列。
SessionId,date,orig, dest, legind, nbr
1 9/20/16,abc0,xyz0,o,1
1 9/20/16,abc1,xyz1,o,2
1 9/20/16,abc2,xyz2,i,3
1 9/20/16,abc3,xyz3,i,4
所以我想生成新的架构,如:
SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4
1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null
逻辑是:
nbr为1且legind = o然后是orig1值(从第1行取指)......
nbr是3,legind = i然后是dest1值(从第3行获取)
那么如何将行转置为列......
任何想法都会非常感激。
尝试使用以下选项,但它只是在单行中展平..
val keys = List("SessionId");
val selectFirstValueOfNoneGroupedColumns =
df.columns
.filterNot(keys.toSet)
.map(_ -> "first").toMap
val grouped =
df.groupBy(keys.head, keys.tail: _*)
.agg(selectFirstValueOfNoneGroupedColumns).show()
答案 0 :(得分:1)
如果使用pivot
函数,则相对简单。首先,让我们创建一个类似问题的数据集:
import org.apache.spark.sql.functions.{concat, first, lit, when}
val df = Seq(
("1", "9/20/16", "abc0", "xyz0", "o", "1"),
("1", "9/20/16", "abc1", "xyz1", "o", "2"),
("1", "9/20/16", "abc2", "xyz2", "i", "3"),
("1", "9/20/16", "abc3", "xyz3", "i", "4")
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr")
然后定义并附加帮助列:
// This will be the column name
val key = when($"legind" === "o", concat(lit("orig"), $"nbr"))
.when($"legind" === "i", concat(lit("dest"), $"nbr"))
// This will be the value
val value = when($"legind" === "o", $"orig") // If o take origin
.when($"legind" === "i", $"dest") // If i take dest
val withKV = df.withColumn("key", key).withColumn("value", value)
这将导致DataFrame
这样:
+---------+-------+----+----+------+---+-----+-----+
|SessionId| date|orig|dest|legind|nbr| key|value|
+---------+-------+----+----+------+---+-----+-----+
| 1|9/20/16|abc0|xyz0| o| 1|orig1| abc0|
| 1|9/20/16|abc1|xyz1| o| 2|orig2| abc1|
| 1|9/20/16|abc2|xyz2| i| 3|dest3| xyz2|
| 1|9/20/16|abc3|xyz3| i| 4|dest4| xyz3|
+---------+-------+----+----+------+---+-----+-----+
接下来让我们定义一个可能的级别列表:
val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y"))
最后pivot
val result = withKV
.groupBy($"sessionId", $"date")
.pivot("key", levels)
.agg(first($"value", true)).show
而result
是:
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|sessionId| date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
| 1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+