在Spark DataFrame中使用udf函数时发生异常

时间:2019-03-07 21:58:35

标签: scala apache-spark dataframe user-defined-functions

在Spark版本:2.4.0中,我试图在给定的DataFrame上执行以下代码: unfoldedDF:org.apache.spark.sql.DataFrame
movieid:整数
words:array-元素:string
令牌:字符串

val tokensWithDf = unfoldedDF.groupBy("tokens").agg(countDistinct("movieid") as "df")
tokensWithDf.show()

创建的新数据框是tokensWithDf:org.apache.spark.sql.DataFrame
令牌:字符串
df:long

在其上执行以下操作。

def findIdf(x : Long) : Double = {scala.math.log10((42306).toDouble/x)}
val sqlfunc = udf(findIdf _)
tokensWithDf.withColumn("idf", sqlfunc(col("df"))).show()

它失败,但出现以下异常:

org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2519)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:866)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:865)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:379)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:865)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:616)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:143)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:183)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:66)
    at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:75)
    at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:497)
    at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:48)

0 个答案:

没有答案