Apache Beam管道与PubSubIO错误使用Spark Runner PubsubUnboundedSource $ PubsubReader.getWatermark(PubsubUnboundedSource.java:1030)

时间:2018-01-02 23:02:21

标签: apache-spark apache-beam google-cloud-pubsub

使用PubSubIO的光束管道正好可以作为Direct Runner和Dataflow运行器运行,但是当我在Spark Runner(独立Spark实例)上运行它时,我收到了PubSubUnboundedSource错误。

这是我从GCP PubSub订阅中读取的代码段,将PubSub消息中包含的内容解析为具有DoFn的对象,从对象中提取事件时间并将生成的Pcollection窗口化为20秒窗口:

// Take input from pubsub and make pcollections of TweetObjects
        PCollection<TweetObj> pubSub_input = pipeline.apply(PubsubIO.readStrings().fromTopic(options.getPubsubTopic()))
                .apply("ParseTweetFromPubSub", ParDo.of(new ProcessEachElement()))
                .apply("AddEventTimestamps", WithTimestamps.of((TweetObj i) -> new Instant(i.getTimestamp()))
                        .withAllowedTimestampSkew(new Duration(Long.MAX_VALUE))
                ).apply("WindowTweetIntoSeconds",
                        Window.<TweetObj>into(FixedWindows.of(Duration.standardSeconds(20)))
                                .triggering(AfterWatermark.pastEndOfWindow()
                                        .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()
                                                .plusDelayOf(Duration.standardSeconds(5)))
                                        .withLateFirings(AfterProcessingTime.pastFirstElementInPane()
                                                .plusDelayOf(Duration.standardSeconds(5))))
                                .withAllowedLateness(Duration.millis(500))
                                .discardingFiredPanes()


       );

我已经交叉引用了Beam Runner兼容性矩阵,但没有发现任何问题。

这是我使用Spark Runner运行此Beam管道时出现的错误(它可以与Dataflow和DirectRunner一起运行),按照https://beam.apache.org/documentation/runners/capability-matrix/ Spark Runner支持事件时间触发器,这就是我正在使用的。 / p>

18/01/02 14:53:25 ERROR JobScheduler: Error generating jobs for time 1514933587500 ms
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 8.0 failed 1 times, most recent failure: Lost task 5.0 in stage 8.0 (TID 18, localhost, executor driver): java.lang.NullPointerException
        at org.apache.beam.sdk.io.gcp.pubsub.PubsubUnboundedSource$PubsubReader.getWatermark(PubsubUnboundedSource.java:1030)
        at org.apache.beam.runners.spark.io.MicrobatchSource$Reader.getWatermark(MicrobatchSource.java:292)
        at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:180)
        at org.apache.beam.runners.spark.stateful.StateSpecFunctions$1.apply(StateSpecFunctions.java:105)
        at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:181)
        at org.apache.spark.streaming.StateSpec$$anonfun$1.apply(StateSpec.scala:180)
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:57)
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
        at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:159)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:153)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
        at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
        at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
        at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
        at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
        at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:108)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

0 个答案:

没有答案