我需要编写一个动态Scala类。它将三个参数作为输入。 input_dataframe,要分解的列的列表和定界符。认为我有下面的数据框。
DataBase TableName Value
dbdev table1_name Value1#Value2#Value3
爆炸后,我期望得到如下结果
DataBase TableName Value Value_Exploded
dbdev table1_name Value1#Value2#Value3 Value1
dbdev table1_name Value1#Value2#Value3 Value2
dbdev table1_name Value1#Value2#Value3 Value3
所以我的问题是如何编写一个Scala类来实现上述目标。 约束是,它必须是通用的。它可能会得到不同的数据框。并且需要爆炸的列(多个)需要传递。
当我只需要爆炸一列时,我就能实现这一目标。请在下面找到-
val explodeColumnName = "Value" //column which i need to explode
val explodeColumnBy = "#" //delimiter
val explodeDF = df.select(df.col("*"), explode(split(col(explodeColumnName), s"$explodeColumnBy")).as (explodeColumnName+"_Exploded"))
但是我当我需要动态爆炸多列时失败。例如假设,我需要炸开Dataframe df的4列。
任何帮助/建议/建议都非常好。
谢谢!
答案 0 :(得分:0)
检查以下代码。
scala> val df = Seq(
(
"dbdev",
"table1_name",
"Value1#Value2#Value3",
"Sample1#Sample2#Sample3"
)
)
.toDF("database","tablename","value","sample")
scala> df.show(false)
+--------+-----------+--------------------+-----------------------+
|database|tablename |value |sample |
+--------+-----------+--------------------+-----------------------+
|dbdev |table1_name|Value1#Value2#Value3|Sample1#Sample2#Sample3|
+--------+-----------+--------------------+-----------------------+
导入所需的库
scala> import org.apache.spark.sql.{Column,DataFrame}
import org.apache.spark.sql.{Column, DataFrame}
定义DFHelper
类。
注意-不要在explode
类中使用DFHelper
作为函数名称,explode
已在内置函数中可用,因此我使用了explodeM
作为功能。
scala> implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def explodeM(delimiter:String,columns:Column*): DataFrame = {
columns.foldLeft(inDF)((indf,column) => indf
.withColumn(column.toString,split(column,delimiter))
.withColumn(column.toString,explode(column))
)
}
}
scala> df.explodeM("#",$"value").show(false) // one column exploding
+--------+-----------+------+-----------------------+
|database|tablename |value |sample |
+--------+-----------+------+-----------------------+
|dbdev |table1_name|Value1|Sample1#Sample2#Sample3|
|dbdev |table1_name|Value2|Sample1#Sample2#Sample3|
|dbdev |table1_name|Value3|Sample1#Sample2#Sample3|
+--------+-----------+------+-----------------------+
scala> df.explodeM("#",$"value",$"sample").show(false) // two columns exploding
+--------+-----------+------+-------+
|database|tablename |value |sample |
+--------+-----------+------+-------+
|dbdev |table1_name|Value1|Sample1|
|dbdev |table1_name|Value1|Sample2|
|dbdev |table1_name|Value1|Sample3|
|dbdev |table1_name|Value2|Sample1|
|dbdev |table1_name|Value2|Sample2|
|dbdev |table1_name|Value2|Sample3|
|dbdev |table1_name|Value3|Sample1|
|dbdev |table1_name|Value3|Sample2|
|dbdev |table1_name|Value3|Sample3|
+--------+-----------+------+-------+