如何在Scala中编写动态爆炸函数(爆炸多列)

时间:2020-07-21 14:42:58

标签: scala apache-spark apache-spark-sql explode

我需要编写一个动态Scala类。它将三个参数作为输入。 input_dataframe,要分解的列的列表和定界符。认为我有下面的数据框。

DataBase     TableName       Value
dbdev        table1_name     Value1#Value2#Value3

爆炸后,我期望得到如下结果

DataBase     TableName       Value                   Value_Exploded
dbdev        table1_name     Value1#Value2#Value3    Value1
dbdev        table1_name     Value1#Value2#Value3    Value2
dbdev        table1_name     Value1#Value2#Value3    Value3

所以我的问题是如何编写一个Scala类来实现上述目标。 约束是,它必须是通用的。它可能会得到不同的数据框。并且需要爆炸的列(多个)需要传递。

当我只需要爆炸一列时,我就能实现这一目标。请在下面找到-

val explodeColumnName = "Value" //column which i need to explode
val explodeColumnBy = "#" //delimiter

val explodeDF = df.select(df.col("*"), explode(split(col(explodeColumnName), s"$explodeColumnBy")).as (explodeColumnName+"_Exploded"))

但是我当我需要动态爆炸多列时失败。例如假设,我需要炸开Dataframe df的4列。

任何帮助/建议/建议都非常好。

谢谢!

1 个答案:

答案 0 :(得分:0)

检查以下代码。

scala> val df = Seq(
     (
         "dbdev",
         "table1_name",
         "Value1#Value2#Value3",
         "Sample1#Sample2#Sample3"
    )
)
.toDF("database","tablename","value","sample")
scala> df.show(false)
+--------+-----------+--------------------+-----------------------+
|database|tablename  |value               |sample                 |
+--------+-----------+--------------------+-----------------------+
|dbdev   |table1_name|Value1#Value2#Value3|Sample1#Sample2#Sample3|
+--------+-----------+--------------------+-----------------------+

导入所需的库

scala> import org.apache.spark.sql.{Column,DataFrame}
import org.apache.spark.sql.{Column, DataFrame}

定义DFHelper类。

注意-不要在explode类中使用DFHelper作为函数名称,explode已在内置函数中可用,因此我使用了explodeM作为功能。

scala> implicit class DFHelper(inDF: DataFrame) {
           import inDF.sparkSession.implicits._          
            def explodeM(delimiter:String,columns:Column*): DataFrame = {
               columns.foldLeft(inDF)((indf,column) => indf
               .withColumn(column.toString,split(column,delimiter))
               .withColumn(column.toString,explode(column))
               )
           }
      }

scala> df.explodeM("#",$"value").show(false) // one column exploding
+--------+-----------+------+-----------------------+
|database|tablename  |value |sample                 |
+--------+-----------+------+-----------------------+
|dbdev   |table1_name|Value1|Sample1#Sample2#Sample3|
|dbdev   |table1_name|Value2|Sample1#Sample2#Sample3|
|dbdev   |table1_name|Value3|Sample1#Sample2#Sample3|
+--------+-----------+------+-----------------------+
scala> df.explodeM("#",$"value",$"sample").show(false) // two columns exploding
+--------+-----------+------+-------+
|database|tablename  |value |sample |
+--------+-----------+------+-------+
|dbdev   |table1_name|Value1|Sample1|
|dbdev   |table1_name|Value1|Sample2|
|dbdev   |table1_name|Value1|Sample3|
|dbdev   |table1_name|Value2|Sample1|
|dbdev   |table1_name|Value2|Sample2|
|dbdev   |table1_name|Value2|Sample3|
|dbdev   |table1_name|Value3|Sample1|
|dbdev   |table1_name|Value3|Sample2|
|dbdev   |table1_name|Value3|Sample3|
+--------+-----------+------+-------+