Spark - 将函数应用于没有UDF

时间:2018-06-18 19:29:26

标签: scala apache-spark

我知道如何使用UDF将函数应用于数据帧的列。但是,我想要创建的功能访问外部服务。基本上,它通过REST API提交文本,然后添加响应的新列。在访问外部服务时,不建议使用UDF。事实上,当我尝试使用UDF进行此操作时,我会遇到偶发错误。

实现这一目标的最佳做法是什么?我不认为我的问题是特定于代码的,但我将在下面添加一些例子:

val entityDf = df.select(col("Text"),
                            col("coordinates"),
                            col("LocX"),
                            col("LocY"))
                            .withColumn("TextClass", functionUdf(col("text")))

我应该在尝试将此函数用作UDF时添加错误,以防万一我正在寻找解决错误问题的方法:

  

线程中的异常" main" org.apache.spark.SparkException:无法执行用户定义的函数($ anonfun $ 5:(string)=> string)

def testFunc(text: String): String ={
    val gson = new Gson()

    val result = Http("url")
      .postData(f"""postdata""")
      .header("Content-Type", "application/json")
      .header("Charset", "UTF-8")
      .option(HttpOptions.readTimeout(10000))
      .asString

    val rootJson = gson.fromJson(result.body, classOf[rootNerJson])

    if(rootJson.classes.length > 0){
      return rootJson.classes(0).label
    }

    return "Null"
  }

val functionUdf = udf[String, String](testFunc)

编辑:Stacktrace:

  

线程中的异常" main" org.apache.spark.SparkException:无法执行用户定义的函数($ anonfun $ 5:(string)=> string)       在org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)       在org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)       在org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)       在org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)       在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)       在scala.collection.TraversableLike $$ anonfun $ map $ 1.apply(TraversableLike.scala:234)       在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)       在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)       在scala.collection.TraversableLike $ class.map(TraversableLike.scala:234)       在scala.collection.AbstractTraversable.map(Traversable.scala:104)       在org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation $$ anonfun $ apply $ 22.applyOrElse(Optimizer.scala:1147)       在org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation $$ anonfun $ apply $ 22.applyOrElse(Optimizer.scala:1142)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 2.apply(TreeNode.scala:267)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 2.apply(TreeNode.scala:267)       在org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:70)       在org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformDown $ 1.apply(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:306)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)       在org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)       at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)       在org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)       在org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation $ .apply(Optimizer.scala:1142)       在org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation $ .apply(Optimizer.scala:1141)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:85)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1 $$ anonfun $ apply $ 1.apply(RuleExecutor.scala:82)       在scala.collection.IndexedSeqOptimized $ class.foldl(IndexedSeqOptimized.scala:57)       在scala.collection.IndexedSeqOptimized $ class.foldLeft(IndexedSeqOptimized.scala:66)       在scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:82)       在org.apache.spark.sql.catalyst.rules.RuleExecutor $$ anonfun $ execute $ 1.apply(RuleExecutor.scala:74)       在scala.collection.immutable.List.foreach(List.scala:392)       在org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)       在org.apache.spark.sql.execution.QueryExecution.optimizedPlan $ lzycompute(QueryExecution.scala:78)       在org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)       在org.apache.spark.sql.execution.QueryExecution.sparkPlan $ lzycompute(QueryExecution.scala:84)       在org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)       在org.apache.spark.sql.execution.QueryExecution.executedPlan $ lzycompute(QueryExecution.scala:89)       在org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)       在org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)       在org.apache.spark.sql.Dataset.head(Dataset.scala:2153)       在org.apache.spark.sql.Dataset.take(Dataset.scala:2366)       在org.apache.spark.sql.Dataset.showString(Dataset.scala:245)       在org.apache.spark.sql.Dataset.show(Dataset.scala:644)       在org.apache.spark.sql.Dataset.show(Dataset.scala:603)       在org.apache.spark.sql.Dataset.show(Dataset.scala:612)       在OCRJson $$ anonfun $ main $ 1.apply(OCRJson.scala:112)       在OCRJson $$ anonfun $ main $ 1.apply(OCRJson.scala:24)       在scala.collection.immutable.List.foreach(List.scala:392)       在OCRJson $ .main(OCRJson.scala:24)       在OCRJson.main(OCRJson.scala)   引起:com.google.gson.JsonSyntaxException:java.lang.IllegalStateException:预期为BEGIN_OBJECT,但在第1行第1行为STRING路径$       在com.google.gson.internal.bind.ReflectiveTypeAdapterFactory $ Adapter.read(ReflectiveTypeAdapterFactory.java:226)       在com.google.gson.Gson.fromJson(Gson.java:922)       在com.google.gson.Gson.fromJson(Gson.java:887)       在com.google.gson.Gson.fromJson(Gson.java:836)       在com.google.gson.Gson.fromJson(Gson.java:808)       在OCRJson $$ anonfun $ main $ 1.OCRJson $$ anonfun $$ ner $ 1(OCRJson.scala:69)       在OCRJson $$ anonfun $ main $ 1 $$ anonfun $ 5.apply(OCRJson.scala:78)       在OCRJson $$ anonfun $ main $ 1 $$ anonfun $ 5.apply(OCRJson.scala:78)       在org.apache.spark.sql.catalyst.expressions.ScalaUDF $$ anonfun $ 2.apply(ScalaUDF.scala:92)       在org.apache.spark.sql.catalyst.expressions.ScalaUDF $$ anonfun $ 2.apply(ScalaUDF.scala:91)       在org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)       ......还有87个   引起:java.lang.IllegalStateException:预期BEGIN_OBJECT但是在第1行第1行STRING STRING       在com.google.gson.stream.JsonReader.beginObject(JsonReader.java:385)       在com.google.gson.internal.bind.ReflectiveTypeAdapterFactory $ Adapter.read(ReflectiveTypeAdapterFactory.java:215)       ... 97更多

0 个答案:

没有答案