下面已完成此操作,并已阅读了有关伴侣对象的内容,我不能说我在2018年遵循并可以很好地向其他人解释:
import org.apache.spark.sql.functions._
val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " "))) // <----
//error: object java.lang.String is not a value --> use Array
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df
.withColumn("udfResult",myUDf(col("val")))
.withColumn("new_val", col("udfResult")(0)) // <----
.drop("udfResult") // <----
new_df.show
是否有一种更优雅的方法来摆脱Array
并以某种方式使用String?
答案 0 :(得分:1)
问题是myUdf
本身的定义。无需将字符串包装到数组中:
val myUDf = udf((s: String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
那么就不需要过多地使用列:
val new_df = df.withColumn("new_val", myUDf(col("val")))
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+