我有一个具有以下架构的数据框
root
|-- name: string (nullable = true)
|-- roll: string (nullable = true)
|-- subjectID: string (nullable = true)
数据框中的值如下
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|av|mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|zz|
+-------------------+---------+--------------------+
我需要通过flattenig主题ID导出数据帧,如下所示。 请注意:SubjectID也是字符串
+-------------------+---------+--------------------+
| name| roll| SubjectID|
+-------------------+---------+--------------------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
| alcaly|ta1i3dfk4| xx|
| alcaly|ta1i3dfk4| zz|
+-------------------+---------+--------------------+
任何建议
答案 0 :(得分:2)
您可以使用explode
函数进行展平。
例如:
val inputDF = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectIDs")
//split and explode `subjectIDs`
val result = input.withColumn("subjectIDs",
split(col("subjectIDs"), "\\|"))
.withColumn("subjectIDs", explode($"subjectIDs"))
resultDF.show()
+------+---------+----------+
| name| roll|subjectIDs|
+------+---------+----------+
| sam|ta1i3dfk4| xy|
| sam|ta1i3dfk4| av|
| sam|ta1i3dfk4| mm|
| royc|rfhqdbnb3| a|
|alcaly|rfhqdbnb3| xx|
|alcaly|rfhqdbnb3| zz|
+------+---------+----------+
答案 1 :(得分:1)
您可以在数据集上使用flatMap
。完整的可执行代码:
package main
import org.apache.spark.sql.{Dataset, SparkSession}
object Main extends App {
case class Roll(name: Option[String], roll: Option[String], subjectID: Option[String])
val mySpark = SparkSession
.builder()
.master("local[2]")
.appName("Spark SQL basic example")
.getOrCreate()
import mySpark.implicits._
val inputDF: Dataset[Roll] = Seq(
("sam", "ta1i3dfk4", "xy|av|mm"),
("royc", "rfhqdbnb3", "a"),
("alcaly", "rfhqdbnb3", "xx|zz")
).toDF("name", "roll", "subjectID").as[Roll]
val out: Dataset[Roll] = inputDF.flatMap {
case Roll(n, r, Some(ids)) if ids.nonEmpty =>
ids.split("\\|").map(id => Roll(n, r, Some(id)))
case x => Some(x)
}
out.show()
}
注意:
split('|')
代替split("\\|")