替换所有":"用" _"在Spark数据帧中

时间:2016-09-03 16:16:02

标签: scala apache-spark user-defined-functions spark-dataframe

我试图替换"的所有实例:" - > " _"在Spark数据帧的单个列中。我试图这样做:

val url_cleaner = (s:String) => {
   s.replaceAll(":","_")
}
val url_cleaner_udf = udf(url_cleaner)
val df = old_df.withColumn("newCol", url_cleaner_udf(old_df("oldCol")) )

但我一直收到错误:

 SparkException: Job aborted due to stage failure: Task 0 in stage 25.0 failed 4 times, most recent failure: Lost task 0.3 in stage 25.0 (TID 692, ip-10-81-194-29.ec2.internal): java.lang.NullPointerException

我在udf哪里出错?

1 个答案:

答案 0 :(得分:12)

可能你在这一栏中有一些空值。

尝试:

val urlCleaner = (s:String) => {
   if (s == null) null else s.replaceAll(":","_")
}

您也可以使用regexp_replace(col("newCol"), ":", "_")代替自己的功能