需要在Scala中基于一列展平数据框

时间:2019-07-10 18:13:16

标签: scala dataframe apache-spark-sql

我有一个具有以下架构的数据框

 root
 |-- name: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- subjectID: string (nullable = true)

数据框中的值如下

+-------------------+---------+--------------------+
|               name|     roll|           SubjectID|
+-------------------+---------+--------------------+
|                sam|ta1i3dfk4|            xy|av|mm|
|               royc|rfhqdbnb3|                   a|
|             alcaly|ta1i3dfk4|               xx|zz|
+-------------------+---------+--------------------+

我需要通过flattenig主题ID导出数据帧,如下所示。 请注意:SubjectID也是字符串

+-------------------+---------+--------------------+
|               name|     roll|           SubjectID|
+-------------------+---------+--------------------+
|                sam|ta1i3dfk4|                  xy|
|                sam|ta1i3dfk4|                  av|
|                sam|ta1i3dfk4|                  mm|
|               royc|rfhqdbnb3|                   a|
|             alcaly|ta1i3dfk4|                  xx|
|             alcaly|ta1i3dfk4|                  zz|
+-------------------+---------+--------------------+

任何建议

2 个答案:

答案 0 :(得分:2)

您可以使用explode函数进行展平。 例如:

 val inputDF = Seq(
      ("sam", "ta1i3dfk4", "xy|av|mm"),
      ("royc", "rfhqdbnb3", "a"),
      ("alcaly", "rfhqdbnb3", "xx|zz")
    ).toDF("name", "roll", "subjectIDs")

  //split and explode `subjectIDs`
val result = input.withColumn("subjectIDs",
  split(col("subjectIDs"), "\\|"))
  .withColumn("subjectIDs", explode($"subjectIDs"))

    resultDF.show()

    +------+---------+----------+ 
    |  name|     roll|subjectIDs|
    +------+---------+----------+
    |   sam|ta1i3dfk4|        xy|
    |   sam|ta1i3dfk4|        av|
    |   sam|ta1i3dfk4|        mm|
    |  royc|rfhqdbnb3|         a|
    |alcaly|rfhqdbnb3|        xx|
    |alcaly|rfhqdbnb3|        zz|
    +------+---------+----------+

答案 1 :(得分:1)

您可以在数据集上使用flatMap。完整的可执行代码:

package main

import org.apache.spark.sql.{Dataset, SparkSession}

object Main extends App {
  case class Roll(name: Option[String], roll: Option[String], subjectID: Option[String])

  val mySpark = SparkSession
    .builder()
    .master("local[2]")
    .appName("Spark SQL basic example")
    .getOrCreate()
  import mySpark.implicits._

  val inputDF: Dataset[Roll] = Seq(
    ("sam", "ta1i3dfk4", "xy|av|mm"),
    ("royc", "rfhqdbnb3", "a"),
    ("alcaly", "rfhqdbnb3", "xx|zz")
  ).toDF("name", "roll", "subjectID").as[Roll]

  val out: Dataset[Roll] = inputDF.flatMap {
    case Roll(n, r, Some(ids)) if ids.nonEmpty =>
      ids.split("\\|").map(id => Roll(n, r, Some(id)))
    case x => Some(x)
  }
  out.show()
}

注意:

  1. 您可以使用split('|')代替split("\\|")
  2. 如果id必须为非空,则可以更改默认句柄