我正在使用带有UDF的Java Spark结构化流,该流返回复杂对象。该程序非常简单,但是会出现以下异常:
2019-07-23 17:43:40 WARN TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$33: (binary, binary, string, int, bigint, timestamp) => struct<>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply$mcV$sp(WriteToDataSourceV2.scala:158)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply(WriteToDataSourceV2.scala:157)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply(WriteToDataSourceV2.scala:157)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.runContinuous(WriteToDataSourceV2.scala:170)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$1.apply(WriteToDataSourceV2.scala:76)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$1.apply(WriteToDataSourceV2.scala:75)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: scala.MatchError: com.my.Result@15cacb4e (of class com.my.Result)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:236)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
错误显示scala.MatchError。后端似乎无法正确处理结构。
这是源代码,它只是从kafka中读取数据并执行返回“结果”的UDF。
String inputServers = "localhost:9092";
String inputKafka= "mytopic";
SparkSession spark = SparkSession
.builder()
.appName("Spark_Java")
.getOrCreate();
UDF6 verifyUdf = (UDF6<byte[], byte[], String, Integer, Long, Timestamp, Result>) (_0, _1, _2, _3, _4, _5) -> {
Result result = new Result();
result.value = "value from topic: "+_2;
result.trace = "trace: partition/offset=["+_3+"]["+_4+"]";
return result;
};
StructType schema = Encoders.bean(Result.class).schema();
spark.udf().register("verify", verifyUdf, schema);
Dataset<Row> lines = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", inputServers)
.option("subscribe", inputKafka)
.load();
Dataset<Row> verify =
lines.withColumn(
"result",
functions.callUDF("verify",
lines.col("key"),
lines.col("value"),
lines.col("topic"),
lines.col("partition"),
lines.col("offset"),
lines.col("timestamp"))
);
StreamingQuery query =verify.writeStream()
.format("console")
.option("checkpointLocation", "/temp/testcheckpoint_java2/")
.trigger(Trigger.Continuous(1000))
.outputMode("Update")
.start();
query.awaitTermination();