如何将逗号分隔的多列拆分为多行?

时间:2019-06-28 12:25:43

标签: scala apache-spark apache-spark-dataset

我有一个带有N个字段的数据框,如下所述。列数和值的长度会有所不同。

输入表:

+--------------+-----------+-----------------------+
|Date          |Amount     |Status                 |
+--------------+-----------+-----------------------+
|2019,2018,2017|100,200,300|IN,PRE,POST            |
|2018          |73         |IN                     |
|2018,2017     |56,89      |IN,PRE                 |
+--------------+-----------+-----------------------+

我必须用一个序列列将其转换为以下格式。

预期输出表:

+-------------+------+---------+
|Date  |Amount|Status| Sequence|
+------+------+------+---------+
|2019  |100   |IN    |   1     |
|2018  |200   |PRE   |   2     |
|2017  |300   |POST  |   3     |
|2018  |73    |IN    |   1     |
|2018  |56    |IN    |   1     |
|2017  |89    |PRE   |   2     |
+-------------+------+---------+

我尝试过使用爆炸,但一次只能爆炸一个阵列。

var df = dataRefined.withColumn("TOT_OVRDUE_TYPE", explode(split($"TOT_OVRDUE_TYPE", "\\"))).toDF

var df1 = df.withColumn("TOT_OD_TYPE_AMT", explode(split($"TOT_OD_TYPE_AMT", "\\"))).show 

有人知道我该怎么做吗?谢谢您的帮助。

6 个答案:

答案 0 :(得分:1)

是的,我个人也觉得TButton有点烦人,在您的情况下,我可能会改用TBitBtn

explode

输出:

flatMap

答案 1 :(得分:1)

这是另一种使用posexplode为每一列并将所有产生的数据帧合并为一个的方法:

导入org.apache.spark.sql.functions。{posexplode,monotonically_increasing_id,col}

val df = Seq(
  (Seq("2019", "2018", "2017"), Seq("100", "200", "300"), Seq("IN", "PRE", "POST")),
  (Seq("2018"), Seq("73"), Seq("IN")),
  (Seq("2018", "2017"), Seq("56", "89"), Seq("IN", "PRE")))
.toDF("Date","Amount", "Status")
.withColumn("idx", monotonically_increasing_id)

df.columns.filter(_ != "idx").map{
  c => df.select($"idx", posexplode(col(c))).withColumnRenamed("col", c)
}
.reduce((ds1, ds2) => ds1.join(ds2, Seq("idx", "pos")))
.select($"Date", $"Amount", $"Status", $"pos".plus(1).as("Sequence"))
.show

输出:

+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|   100|    IN|       1|
|2018|   200|   PRE|       2|
|2017|   300|  POST|       3|
|2018|    73|    IN|       1|
|2018|    56|    IN|       1|
|2017|    89|   PRE|       2|
+----+------+------+--------+

答案 2 :(得分:1)

您可以使用 Dataframe内置函数来实现此目的 arrays_zipsplitposexplode

Explanation:

scala>val df=Seq((("2019,2018,2017"),("100,200,300"),("IN,PRE,POST")),(("2018"),("73"),("IN")),(("2018,2017"),("56,89"),("IN,PRE"))).toDF("date","amount","status")

scala>:paste
df.selectExpr("""posexplode(
                            arrays_zip(
                                        split(date,","), //split date string with ',' to create array
                                        split(amount,","),
                                        split(status,","))) //zip arrays
                            as (p,colum) //pos explode on zip arrays will give position and column value
            """)
    .selectExpr("colum.`0` as Date", //get 0 column as date
                "colum.`1` as Amount", 
                "colum.`2` as Status", 
                "p+1 as Sequence") //add 1 to the position value
    .show()

Result:

+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|   100|    IN|       1|
|2018|   200|   PRE|       2|
|2017|   300|  POST|       3|
|2018|    73|    IN|       1|
|2018|    56|    IN|       1|
|2017|    89|   PRE|       2|
+----+------+------+--------+

答案 3 :(得分:0)

假设每一行中每一列中的数据元素数量相同:

首先,我重新创建了您的DataFrame

import org.apache.spark.sql._
import scala.collection.mutable.ListBuffer

val df = Seq(("2019,2018,2017", "100,200,300", "IN,PRE,POST"), ("2018", "73", "IN"),
  ("2018,2017", "56,89", "IN,PRE")).toDF("Date", "Amount", "Status")

接下来,我拆分行并添加一个序列值,然后转换回DF:

val exploded = df.rdd.flatMap(row => {
  val buffer = new ListBuffer[(String, String, String, Int)]
  val dateSplit = row(0).toString.split("\\,", -1)
  val amountSplit = row(1).toString.split("\\,", -1)
  val statusSplit = row(2).toString.split("\\,", -1)
  val seqSize = dateSplit.size
  for(i <- 0 to seqSize-1)
    buffer += Tuple4(dateSplit(i), amountSplit(i), statusSplit(i), i+1)
  buffer.toList
}).toDF((df.columns:+"Sequence"): _*)

我敢肯定还有其他方法可以做到,而无需先将DF转换为RDD,但这仍然会导致DF具有正确的答案。

如果您有任何疑问,请告诉我。

答案 4 :(得分:0)

我利用转置按位置压缩所有序列,然后进行了爆炸。 dataFrame上的选择是动态的,可以满足以下条件: 列数 和值的长度将变化 问题。

import org.apache.spark.sql.functions._


val df = Seq(
  ("2019,2018,2017", "100,200,300", "IN,PRE,POST"),
  ("2018", "73", "IN"),
  ("2018,2017", "56,89", "IN,PRE")
).toDF("Date", "Amount", "Status")
df: org.apache.spark.sql.DataFrame = [Date: string, Amount: string ... 1 more field]

scala> df.show(false)
+--------------+-----------+-----------+
|Date          |Amount     |Status     |
+--------------+-----------+-----------+
|2019,2018,2017|100,200,300|IN,PRE,POST|
|2018          |73         |IN         |
|2018,2017     |56,89      |IN,PRE     |
+--------------+-----------+-----------+


scala> def transposeSeqOfSeq[S](x:Seq[Seq[S]]): Seq[Seq[S]] = { x.transpose }
transposeSeqOfSeq: [S](x: Seq[Seq[S]])Seq[Seq[S]]

scala> val myUdf = udf { transposeSeqOfSeq[String] _}
myUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true))))

scala> val df2 = df.select(df.columns.map(c => split(col(c), ",") as c): _*)
df2: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 1 more field]

scala> df2.show(false)
+------------------+---------------+---------------+
|Date              |Amount         |Status         |
+------------------+---------------+---------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|
|[2018]            |[73]           |[IN]           |
|[2018, 2017]      |[56, 89]       |[IN, PRE]      |
+------------------+---------------+---------------+


scala> val df3 = df2.withColumn("allcols", array(df.columns.map(c => col(c)): _*))
df3: org.apache.spark.sql.DataFrame = [Date: array<string>, Amount: array<string> ... 2 more fields]

scala> df3.show(false)
+------------------+---------------+---------------+------------------------------------------------------+
|Date              |Amount         |Status         |allcols                                               |
+------------------+---------------+---------------+------------------------------------------------------+
|[2019, 2018, 2017]|[100, 200, 300]|[IN, PRE, POST]|[[2019, 2018, 2017], [100, 200, 300], [IN, PRE, POST]]|
|[2018]            |[73]           |[IN]           |[[2018], [73], [IN]]                                  |
|[2018, 2017]      |[56, 89]       |[IN, PRE]      |[[2018, 2017], [56, 89], [IN, PRE]]                   |
+------------------+---------------+---------------+------------------------------------------------------+


scala> val df4 = df3.withColumn("ab", myUdf($"allcols")).select($"ab", posexplode($"ab"))
df4: org.apache.spark.sql.DataFrame = [ab: array<array<string>>, pos: int ... 1 more field]

scala> df4.show(false)
+------------------------------------------------------+---+-----------------+
|ab                                                    |pos|col              |
+------------------------------------------------------+---+-----------------+
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|0  |[2019, 100, IN]  |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|1  |[2018, 200, PRE] |
|[[2019, 100, IN], [2018, 200, PRE], [2017, 300, POST]]|2  |[2017, 300, POST]|
|[[2018, 73, IN]]                                      |0  |[2018, 73, IN]   |
|[[2018, 56, IN], [2017, 89, PRE]]                     |0  |[2018, 56, IN]   |
|[[2018, 56, IN], [2017, 89, PRE]]                     |1  |[2017, 89, PRE]  |
+------------------------------------------------------+---+-----------------+

scala> val selCols = (0 until df.columns.length).map(i => $"col".getItem(i).as(df.columns(i))) :+ ($"pos"+1).as("Sequence")
selCols: scala.collection.immutable.IndexedSeq[org.apache.spark.sql.Column] = Vector(col[0] AS `Date`, col[1] AS `Amount`, col[2] AS `Status`, (pos + 1) AS `Sequence`)

scala> df4.select(selCols:_*).show(false)
+----+------+------+--------+
|Date|Amount|Status|Sequence|
+----+------+------+--------+
|2019|100   |IN    |1       |
|2018|200   |PRE   |2       |
|2017|300   |POST  |3       |
|2018|73    |IN    |1       |
|2018|56    |IN    |1       |
|2017|89    |PRE   |2       |
+----+------+------+--------+

答案 5 :(得分:0)

这就是为什么我喜欢Spark-core API。借助 map和flatMap 的帮助,您可以处理许多问题。只需将您的df和SQLContext实例传递给以下方法,它将提供所需的结果-

def reShapeDf(df:DataFrame, sqlContext: SQLContext): DataFrame ={

    val rdd = df.rdd.map(m => (m.getAs[String](0),m.getAs[String](1),m.getAs[String](2)))

    val rdd1 = rdd.flatMap(a => a._1.split(",").zip(a._2.split(",")).zip(a._3.split(",")))
    val rdd2 = rdd1.map{
      case ((a,b),c) => (a,b,c)
    }

    sqlContext.createDataFrame(rdd2.map(m => Row.fromTuple(m)),df.schema)
}