为了增加Spark Streaming Programming指南中建议的并行性,我设置了多个接收器并尝试将它们列表联合起来。此代码按预期工作:
private JavaDStream<SparkFlumeEvent> getEventsWorking(JavaStreamingContext jssc, List<String> hosts, List<String> ports) {
List<JavaReceiverInputDStream<SparkFlumeEvent>> receivers = new ArrayList<>();
for (String host : hosts) {
for (String port : ports) {
receivers.add(FlumeUtils.createStream(jssc, host, Integer.parseInt(port)));
}
}
JavaDStream<SparkFlumeEvent> unionStreams = receivers.get(0)
.union(receivers.get(1))
.union(receivers.get(2))
.union(receivers.get(3))
.union(receivers.get(4))
.union(receivers.get(5));
return unionStreams;
}
但我实际上并不知道我的群集在运行时会有多少个接收器。当我尝试在循环中执行此操作时,我得到了一个NPE。
private JavaDStream<SparkFlumeEvent> getEventsNotWorking(JavaStreamingContext jssc, List<String> hosts, List<String> ports) {
List<JavaReceiverInputDStream<SparkFlumeEvent>> receivers = new ArrayList<>();
for (String host : hosts) {
for (String port : ports) {
receivers.add(FlumeUtils.createStream(jssc, host, Integer.parseInt(port)));
}
}
JavaDStream<SparkFlumeEvent> unionStreams = null;
for (JavaReceiverInputDStream<SparkFlumeEvent> receiver : receivers) {
if (unionStreams == null) {
unionStreams = receiver;
} else {
unionStreams.union(receiver);
}
}
return unionStreams;
}
ERROR:
16/09/15 17:05:25错误JobScheduler:作业生成器出错 显示java.lang.NullPointerException 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ getMaxInputStreamRememberDuration $ 2.apply(DStreamGraph.scala:172) 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ getMaxInputStreamRememberDuration $ 2.apply(DStreamGraph.scala:172) 在scala.collection.TraversableOnce $$ anonfun $ maxBy $ 1.apply(TraversableOnce.scala:225) 在scala.collection.IndexedSeqOptimized $ class.foldl(IndexedSeqOptimized.scala:51) 在scala.collection.IndexedSeqOptimized $ class.reduceLeft(IndexedSeqOptimized.scala:68) 在scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47) 在scala.collection.TraversableOnce $ class.maxBy(TraversableOnce.scala:225) 在scala.collection.AbstractTraversable.maxBy(Traversable.scala:105) 在org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172) 在org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:270) 在org.apache.spark.streaming.scheduler.JobGenerator.org $ apache $ spark $ streaming $ scheduler $ JobGenerator $$ processEvent(JobGenerator.scala:182) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anon $ 1.onReceive(JobGenerator.scala:87) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anon $ 1.onReceive(JobGenerator.scala:86) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 16/09/15 17:05:25 INFO MemoryStore:ensureFreeSpace(15128)调用 curMem = 520144,maxMem = 555755765 16/09/15 17:05:25 INFO MemoryStore: 阻止broadcast_24存储为内存中的值(估计大小为14.8 KB, 免费529.5 MB)线程中的异常&#34; main&#34; 显示java.lang.NullPointerException 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ getMaxInputStreamRememberDuration $ 2.apply(DStreamGraph.scala:172) 在org.apache.spark.streaming.DStreamGraph $$ anonfun $ getMaxInputStreamRememberDuration $ 2.apply(DStreamGraph.scala:172) 在scala.collection.TraversableOnce $$ anonfun $ maxBy $ 1.apply(TraversableOnce.scala:225) 在scala.collection.IndexedSeqOptimized $ class.foldl(IndexedSeqOptimized.scala:51) 在scala.collection.IndexedSeqOptimized $ class.reduceLeft(IndexedSeqOptimized.scala:68) 在scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47) 在scala.collection.TraversableOnce $ class.maxBy(TraversableOnce.scala:225) 在scala.collection.AbstractTraversable.maxBy(Traversable.scala:105) 在org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:172) 在org.apache.spark.streaming.scheduler.JobGenerator.clearMetadata(JobGenerator.scala:270) 在org.apache.spark.streaming.scheduler.JobGenerator.org $ apache $ spark $ streaming $ scheduler $ JobGenerator $$ processEvent(JobGenerator.scala:182) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anon $ 1.onReceive(JobGenerator.scala:87) 在org.apache.spark.streaming.scheduler.JobGenerator $$ anon $ 1.onReceive(JobGenerator.scala:86) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)
这样做的正确方法是什么?
答案 0 :(得分:0)
请你试试下面的代码,它会解决你的问题:
private JavaDStream<SparkFlumeEvent> getEventsNotWorking(JavaStreamingContext jssc, List<String> hosts, List<String> ports) {
List<JavaDStream<SparkFlumeEvent>> receivers = new ArrayList<JavaDStream<SparkFlumeEvent>>();
for (String host : hosts) {
for (String port : ports) {
receivers.add(FlumeUtils.createStream(jssc, host, Integer.parseInt(port)));
}
}
return jssc.union(receivers.get(0), receivers.subList(1, receivers.size()));;
}