如何对每行使用带分隔符的分割功能?

时间:2020-07-13 11:17:29

标签: apache-spark apache-spark-sql

Input DF:
+-------------------+---------+
|VALUES             |Delimiter|
+-------------------+---------+
|50000.0#0#0#       |#        |
|0@1000.0@          |@        |
|1$                 |$        |
|1000.00^Test_string|^        |
+-------------------+---------+

Expected Output DF:
+-------------------+---------+----------------------+
|VALUES             |Delimiter|SPLITED_VALUES        |
+-------------------+---------+----------------------+
|50000.0#0#0#       |#        |[50000.0, 0, 0]       |
|0@1000.0@          |@        |[0, 1000.0]           |
|1$                 |$        |[1]                   |
|1000.00^Test_string|^        |[1000.00, Test_string]|
+-------------------+---------+----------------------+

  

代码:

import sparkSession.sqlContext.implicits._
val dept = Seq(("50000.0#0#0#", "#"),("0@1000.0@", "@"),("1$", "$"),("1000.00^Test_string", "^")).toDF("VALUES", "Delimiter")

我刚起步,尝试使用另一列中的Delimiter拆分“ VALUES”列的值。

试图将火花分割功能用作

val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), "#"))

但是这里的split函数将定界符作为常量值,我无法将其传递为

val dept2 = dept.withColumn("SPLITED_VALUES", split(col("VALUES"), col("Delimiter")))

有人可以为此建议更好的方法吗?

1 个答案:

答案 0 :(得分:4)

检查以下代码。

scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("split_values",expr("split(values,delimiter)"))
.show(false)
+-------------------+---------+----------------------+
|values             |delimiter|split_value           |
+-------------------+---------+----------------------+
|50000.0#0#0#       |\#       |[50000.0, 0, 0, ]     |
|0@1000.0@          |\@       |[0, 1000.0, ]         |
|1$                 |\$       |[1, ]                 |
|1000.00^Test_string|\^       |[1000.00, Test_string]|
+-------------------+---------+----------------------+

已更新

scala> df
.withColumn("delimiter",concat(lit("\\"),$"delimiter"))
.withColumn("data",expr("array_remove(split(trim(values),delimiter),'')"))
.show(false)

+-------------------+---------+----------------------+
|values             |delimiter|data                  |
+-------------------+---------+----------------------+
|50000.0#0#0#       |\#       |[50000.0, 0, 0]       |
|0@1000.0@          |\@       |[0, 1000.0]           |
|1$                 |\$       |[1]                   |
|1000.00^Test_string|\^       |[1000.00, Test_string]|
+-------------------+---------+----------------------+