UDF中的Java字符串针对新手

时间:2018-06-26 19:46:46

标签: scala apache-spark user-defined-functions

下面已完成此操作,并已阅读了有关伴侣对象的内容,我不能说我在2018年遵循并可以很好地向其他人解释:

import org.apache.spark.sql.functions._

val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " "))) // <----

//error: object java.lang.String is not a value --> use Array

val data = List("i  like    cheese", "  the dog runs   ", "text111111   text2222222")
val df = data.toDF("val")
df.show()

 val new_df = df
  .withColumn("udfResult",myUDf(col("val")))
  .withColumn("new_val", col("udfResult")(0)) // <----
  .drop("udfResult")                          // <----

 new_df.show

是否有一种更优雅的方法来摆脱Array并以某种方式使用String?

1 个答案:

答案 0 :(得分:1)

问题是myUdf本身的定义。无需将字符串包装到数组中:

val myUDf = udf((s: String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)

那么就不需要过多地使用列:

val new_df = df.withColumn("new_val", myUDf(col("val")))

+--------------------+--------------------+
|                 val|             new_val|
+--------------------+--------------------+
|   i  like    cheese|       i like cheese|
|     the dog runs   |        the dog runs|
|text111111   text...|text111111 text22...|
+--------------------+--------------------+