Spark Streaming,foreachRDD错误:比较方法违反了其一般合同

时间:2015-07-17 13:36:28

标签: java apache-spark spark-streaming

我正在测试Spark Streaming API。 该应用程序部署在Spark 1.4.0的Amazon EMR集群上 我正在排序数据并在S3中保存文件。

管道代码(排序算法除外)详述如下:

public KinesisPreProcessPipeline(JavaStreamingContext jssc, final KinesisPreProcessModuleConfiguration moduleConfiguration) {
    JavaReceiverInputDStream<byte[]> inputDStream = KinesisUtils.createStream(jssc, moduleConfiguration.getAppName(), moduleConfiguration.getStreamName(),
            "kinesis." + moduleConfiguration.getRegion() + ".amazonaws.com", moduleConfiguration.getRegion(), InitialPositionInStream.LATEST,
            Durations.seconds(5), StorageLevel.MEMORY_AND_DISK_SER());

    JavaDStream<StreamingMessage> messageJavaDStream = inputDStream.map(new Function<byte[], StreamingMessage>() {
        @Override
        public StreamingMessage call(byte[] bytes) throws Exception {
            return jsonParser.fromJson(new String(bytes), StreamingMessage.class);
        }
    });

    final String destinationFolder = moduleConfiguration.getDestinationFolder();

    StreamingPreProcessPipeline pipeline = new StreamingPreProcessPipeline().withInputDStream(messageJavaDStream)
            .withPreProcessStep(new SortPreProcess());

    JavaDStream<StreamingMessage> output = pipeline.execute();

    output.checkpoint(Durations.seconds(moduleConfiguration.getBatchInterval() * 2));

    JavaDStream<String> messagesAsJson = output.map(new Function<StreamingMessage, String>() {
        @Override
        public String call(StreamingMessage message) throws Exception {
            return jsonParser.toJson(message);
        }
    });

    messagesAsJson.foreachRDD(new Function<JavaRDD<String>, Void>() {
        @Override
        public Void call(JavaRDD<String> rdd) throws Exception {
                rdd.saveAsTextFile(destinationFolder + "/" + dateFormat.print(new DateTime()) + "-" + rdd.id());
            return null;
        }
    });
}

当应用程序在群集上运行时,它会快速失败并出现以下错误。

  

15/07/17 13:17:36 ERROR executor.Executor:第8.0阶段任务0.1中的异常(TID 90)   java.lang.IllegalArgumentException:比较方法违反了它的一般合同!           在org.apache.spark.util.collection.TimSort $ SortState.mergeLo(TimSort.java:776)           at org.apache.spark.util.collection.TimSort $ SortState.mergeAt(TimSort.java:507)           在org.apache.spark.util.collection.TimSort $ SortState.mergeCollapse(TimSort.java:435)           在org.apache.spark.util.collection.TimSort $ SortState.access $ 200(TimSort.java:307)           在org.apache.spark.util.collection.TimSort.sort(TimSort.java:135)           在org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)           在org.apache.spark.util.collection.PartitionedPairBuffer.partitionedDestructiveSortedIterator(PartitionedPairBuffer.scala:70)           在org.apache.spark.util.collection.ExternalSorter.partitionedIterator(ExternalSorter.scala:690)           在org.apache.spark.util.collection.ExternalSorter.iterator(ExternalSorter.scala:708)           在org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:64)           在org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:244)           在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)           在org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:242)           在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)           在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)           在org.apache.spark.rdd.RDD.iterator(RDD.scala:244)           在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)           在org.apache.spark.scheduler.Task.run(Task.scala:70)           在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:213)           在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)           at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)           在java.lang.Thread.run(Thread.java:745

错误发生在foreachRDD步骤上,但我仍在搜索它失败的原因......

1 个答案:

答案 0 :(得分:2)

用于排序的类在compareTo实现中有一个错误。 Comparable 的javadoc建议以与equals()一致的方式实现compareTo。修复此错误后,spark工作按预期工作。