我正在测试Spark Streaming API。 该应用程序部署在Spark 1.4.0的Amazon EMR集群上 我正在排序数据并在S3中保存文件。
管道代码(排序算法除外)详述如下:
public KinesisPreProcessPipeline(JavaStreamingContext jssc, final KinesisPreProcessModuleConfiguration moduleConfiguration) {
JavaReceiverInputDStream<byte[]> inputDStream = KinesisUtils.createStream(jssc, moduleConfiguration.getAppName(), moduleConfiguration.getStreamName(),
"kinesis." + moduleConfiguration.getRegion() + ".amazonaws.com", moduleConfiguration.getRegion(), InitialPositionInStream.LATEST,
Durations.seconds(5), StorageLevel.MEMORY_AND_DISK_SER());
JavaDStream<StreamingMessage> messageJavaDStream = inputDStream.map(new Function<byte[], StreamingMessage>() {
@Override
public StreamingMessage call(byte[] bytes) throws Exception {
return jsonParser.fromJson(new String(bytes), StreamingMessage.class);
}
});
final String destinationFolder = moduleConfiguration.getDestinationFolder();
StreamingPreProcessPipeline pipeline = new StreamingPreProcessPipeline().withInputDStream(messageJavaDStream)
.withPreProcessStep(new SortPreProcess());
JavaDStream<StreamingMessage> output = pipeline.execute();
output.checkpoint(Durations.seconds(moduleConfiguration.getBatchInterval() * 2));
JavaDStream<String> messagesAsJson = output.map(new Function<StreamingMessage, String>() {
@Override
public String call(StreamingMessage message) throws Exception {
return jsonParser.toJson(message);
}
});
messagesAsJson.foreachRDD(new Function<JavaRDD<String>, Void>() {
@Override
public Void call(JavaRDD<String> rdd) throws Exception {
rdd.saveAsTextFile(destinationFolder + "/" + dateFormat.print(new DateTime()) + "-" + rdd.id());
return null;
}
});
}
当应用程序在群集上运行时,它会快速失败并出现以下错误。
15/07/17 13:17:36 ERROR executor.Executor:第8.0阶段任务0.1中的异常(TID 90) java.lang.IllegalArgumentException:比较方法违反了它的一般合同! 在org.apache.spark.util.collection.TimSort $ SortState.mergeLo(TimSort.java:776) at org.apache.spark.util.collection.TimSort $ SortState.mergeAt(TimSort.java:507) 在org.apache.spark.util.collection.TimSort $ SortState.mergeCollapse(TimSort.java:435) 在org.apache.spark.util.collection.TimSort $ SortState.access $ 200(TimSort.java:307) 在org.apache.spark.util.collection.TimSort.sort(TimSort.java:135) 在org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) 在org.apache.spark.util.collection.PartitionedPairBuffer.partitionedDestructiveSortedIterator(PartitionedPairBuffer.scala:70) 在org.apache.spark.util.collection.ExternalSorter.partitionedIterator(ExternalSorter.scala:690) 在org.apache.spark.util.collection.ExternalSorter.iterator(ExternalSorter.scala:708) 在org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:64) 在org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 在org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:242) 在org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:244) 在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) 在org.apache.spark.scheduler.Task.run(Task.scala:70) 在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:213) 在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615) 在java.lang.Thread.run(Thread.java:745
错误发生在foreachRDD步骤上,但我仍在搜索它失败的原因......
答案 0 :(得分:2)
用于排序的类在compareTo实现中有一个错误。 Comparable 的javadoc建议以与equals()一致的方式实现compareTo。修复此错误后,spark工作按预期工作。