如何在UDF中创建数据框

时间:2019-11-21 03:40:49

标签: apache-spark apache-spark-sql

我有问题。我想在UDF中创建一个DataFrame并使用我的模型将其转换为另一个。但是我得到这个异常。 Spark Conf有什么问题吗?我不知道。有没有人可以帮助我解决这个问题?

代码:

val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)

def model_predict(id:Long, text:String):Double = {
  val modelLoaded = modelBroad.value
  val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
  val dataDF = sparkss.createDataFrame(Seq((id,text))).toDF("id","text")
  val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
  println(f"The prediction of $id and $text is $result")
  result
}

val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text")).show()

例外:

Caused by: java.lang.NullPointerException
        at org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:56)
        at org.apache.spark.sql.execution.LocalTableScanExec.metrics$lzycompute(LocalTableScanExec.scala:37)
        at org.apache.spark.sql.execution.LocalTableScanExec.metrics(LocalTableScanExec.scala:36)
        at org.apache.spark.sql.execution.SparkPlan.resetMetrics(SparkPlan.scala:85)
        at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3366)
        at org.apache.spark.sql.Dataset$$anonfun$withAction$1.apply(Dataset.scala:3365)
        at org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3365)
        at org.apache.spark.sql.Dataset.collect(Dataset.scala:2788)
        at com.zamplus.mine.SparkSubmit$.com$zamplus$mine$SparkSubmit$$model_predict$1(SparkSubmit.scala:21)
        at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
        at com.zamplus.mine.SparkSubmit$$anonfun$1.apply(SparkSubmit.scala:40)
        ... 22 more

1 个答案:

答案 0 :(得分:0)

您的UDF存在问题。 UDF在多个实例上运行,并使用我们在其中使用的所有变量。因此,您应该将所有必需的全局变量作为参数传递,例如modelBroad,否则它将为您提供null pointer exception

您没有在UDF中遵循的更多良好实践。其中一些是:

  1. 您不需要在UDF中创建spark session。否则,它将创建多个spark会话,这将导致问题。如果需要的话,可以代替将全局spark会话作为UDF中的变量传递。
  2. 删除UDF中不必要的pritnln,这也会影响您的退货。

我已更改您的代码,仅供参考。它只是理想UDF的原型。请相应地进行更改。

val sparkss = SparkSession.builder.master("local[*]").getOrCreate()
val model = PipelineModel.load("/user/abel/model/pipeline_model")
val modelBroad = spark.sparkContext.broadcast(model)

def model_predict(id:Long, text:String,spark:SparkSession,modelBroad:<datatype>):Double = {
  val modelLoaded = modelBroad.value
  val dataDF = spark.createDataFrame(Seq((id,text))).toDF("id","text")
  val result = modelLoaded.transform(dataDF).select("prediction").collect().apply(0).getDouble(0)
  result
}

val udf_func = udf(model_predict _)
test.withColumn("prediction",udf_func($"id",$"text",lit(sparkss),lit(modelBroad))).show()