Spark UDF返回结构scala.MatchError

时间:2019-07-23 11:24:05

标签: java apache-spark apache-spark-sql

我正在使用带有UDF的Java Spark结构化流,该流返回复杂对象。该程序非常简单,但是会出现以下异常:

2019-07-23 17:43:40 WARN  TaskSetManager:66 - Lost task 0.0 in stage 0.0 (TID 0, localhost, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$33: (binary, binary, string, int, bigint, timestamp) => struct<>)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at scala.collection.Iterator$class.foreach(Iterator.scala:893)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.foreach(WholeStageCodegenExec.scala:612)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply$mcV$sp(WriteToDataSourceV2.scala:158)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply(WriteToDataSourceV2.scala:157)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$runContinuous$1.apply(WriteToDataSourceV2.scala:157)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1414)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.runContinuous(WriteToDataSourceV2.scala:170)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$1.apply(WriteToDataSourceV2.scala:76)
        at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$1.apply(WriteToDataSourceV2.scala:75)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: scala.MatchError: com.my.Result@15cacb4e (of class com.my.Result)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:236)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:231)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)

错误显示scala.MatchError。后端似乎无法正确处理结构。

这是源代码,它只是从kafka中读取数据并执行返回“结果”的UDF。

    String inputServers = "localhost:9092";
    String inputKafka= "mytopic";

    SparkSession spark = SparkSession
            .builder()
            .appName("Spark_Java")
            .getOrCreate();

    UDF6 verifyUdf = (UDF6<byte[], byte[], String, Integer, Long, Timestamp, Result>) (_0, _1, _2, _3, _4, _5) -> {
        Result result = new Result();
        result.value = "value from topic: "+_2;
        result.trace = "trace: partition/offset=["+_3+"]["+_4+"]";
        return result;
    };

    StructType schema = Encoders.bean(Result.class).schema();
    spark.udf().register("verify", verifyUdf, schema);

    Dataset<Row> lines = spark
            .readStream()
            .format("kafka")
            .option("kafka.bootstrap.servers", inputServers)
            .option("subscribe", inputKafka)
            .load();

    Dataset<Row> verify =
            lines.withColumn(
                    "result",
                    functions.callUDF("verify",
                    lines.col("key"),
                    lines.col("value"),
                    lines.col("topic"),
                    lines.col("partition"),
                    lines.col("offset"),
                    lines.col("timestamp"))
            );

    StreamingQuery query =verify.writeStream()
                    .format("console")
                    .option("checkpointLocation", "/temp/testcheckpoint_java2/")
                    .trigger(Trigger.Continuous(1000))
                    .outputMode("Update")
                    .start();

    query.awaitTermination();

0 个答案:

没有答案