如何使用Java在Spark DataFrame中将单行拆分为多行

时间:2018-10-21 04:37:42

标签: java apache-spark apache-spark-sql

我有一张表,如下所示:

Table 1


我想使用 Spark Java Spark Scala

将其转换为下表

Transformed Table 1

3 个答案:

答案 0 :(得分:4)

确保您具有唯一的列名,可以拒绝:

import or.apache.spark.sql.functions._

table
  .select("id","movie",explode(array("cast1", "cast2", "cast3", "cast4")).as("cast"))
  .where(col("cast").isNotNull)

答案 1 :(得分:0)

table.groupBy("ID", "Movie")
  .agg(collect_list("Cast1", "Cast2", "Cast3", "Cast2").as("cast"))
  .withColumn("cast", explode("cast"))

//注意:您应该始终避免在同一DataFrame中重复列名

答案 2 :(得分:0)

带有“联合”:

val table = List(
  (101, "ABC", "A", "B", "C", "D"),
  (102, "XZY", "G", "J", null, null))
  .toDF("ID", "Movie", "Cast1", "Cast2", "Cast3", "Cast4")

val columnsToUnion = List("Cast1", "Cast2", "Cast3", "Cast4")
val result = columnsToUnion.map(name => table.select($"ID", $"Movie", col(name).alias("Cast")).where(col(name).isNotNull))
  .reduce(_ union _)
result.show(false)

输出:

+---+-----+----+
|ID |Movie|Cast|
+---+-----+----+
|101|ABC  |A   |
|102|XZY  |G   |
|101|ABC  |B   |
|102|XZY  |J   |
|101|ABC  |C   |
|101|ABC  |D   |
+---+-----+----+

注意:表不能有多个具有相同名称的列,假设列名称具有以下模式:“ Cast [i]”