Input DF:
+-------------------+---------+
|VALUES |Delimiter|
+-------------------+---------+
|50000.0#0#0# |# |
|0@1000.0@ |@ |
|1$ |$ |
|1000.00^Test_string|^ |
+-------------------+---------+
Expected Output DF:
+-------------------+---------+----------------------+
|VALUES |Delimiter|SPLITED_VALUES |
+-------------------+---------+----------------------+
|50000.0#0#0# |# |[50000.0, 0, 0] |
|0@1000.0@ |@ |[0, 1000.0] |
|1$ |$ |[1] |
|1000.00^Test_string|^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
代码:
import sparkSession.sqlContext.implicits._
val dept = Seq(("50000.0#0#0#", "#"),("0@1000.0@", "@"),("1$", "$"),("1000.00^Test_string", "^")).toDF("VALUES", "Delimiter")
我刚起步,尝试使用另一列中的Delimiter拆分“ VALUES”列的值。
试图将火花分割功能用作
val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), "#"))
但是这里的split函数将定界符作为常量值,我无法将其传递为
val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), col("Delimiter")))
有人可以为此建议更好的方法吗?
答案 0 :(得分:4)
检查以下代码。
scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("split_values",expr("split(values,delimiter)"))
.show(false)
+-------------------+---------+----------------------+
|values |delimiter|split_value |
+-------------------+---------+----------------------+
|50000.0#0#0# |\# |[50000.0, 0, 0, ] |
|0@1000.0@ |\@ |[0, 1000.0, ] |
|1$ |\$ |[1, ] |
|1000.00^Test_string|\^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+
已更新
scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("data",expr("array_remove(split(trim(values),delimiter),'')"))
.show(false)
+-------------------+---------+----------------------+
|values |delimiter|data |
+-------------------+---------+----------------------+
|50000.0#0#0# |\# |[50000.0, 0, 0] |
|0@1000.0@ |\@ |[0, 1000.0] |
|1$ |\$ |[1] |
|1000.00^Test_string|\^ |[1000.00, Test_string]|
+-------------------+---------+----------------------+