Question

我有一个非常简单的两行程序。我从Kafka获得输入，我将输入保存在文件中并使用累加器计算接收的输入。

我的代码看起来像这样，当我运行这段代码时，每个输入都会得到两个累加器计数。

HashMap<String, String> kafkaParams = new HashMap<String, String>();  kafkaParams.put("metadata.broker.list",    "localhost:9092");   kafkaParams.put("zookeeper.connect", "localhost:2181");
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream( jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet);
final Accumulator<Integer> accum = jssc.sparkContext().accumulator(0);
JavaDStream<String> lines = messages.map(
new Function<Tuple2<String, String>, String>() {
               public String call(Tuple2<String, String> tuple2) { accum.add(1); return tuple2._2(); 
} });
lines.foreachRDD(new Function<JavaRDD<String>, Void>() {
public Void call(JavaRDD<String> rdd) throws Exception { 
if(!rdd.isEmpty() || !rdd.partitions().isEmpty()){ rdd.saveAsTextFile("hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text");}
System.out.println(" &&&&&&&&&&&&&&&&&&&&& COUNT OF ACCUMULATOR IS " + accum.value()); return null;} 
 });
 jssc.start();

如果我删除了这个saveAsTextFile，我得到了这行的正确计数，我正在重复计算。

Here are the Stack trace with SaveAsText statement Please see double counting below:

&&&&&&&&&&&&&&&&&&&&&&& BEFORE COUNT OF ACCUMULATOR IS &&&&&&&&&&&&&&& 0
INFO : org.apache.spark.SparkContext - Starting job: foreachRDD at KafkaURLStreaming.java:90
INFO : org.apache.spark.scheduler.DAGScheduler - Got job 0 (foreachRDD at KafkaURLStreaming.java:90) with 1 output partitions
INFO : org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 0(foreachRDD at KafkaURLStreaming.java:90)
INFO : org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
INFO : org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
INFO : org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 0 (MapPartitionsRDD[1] at map at KafkaURLStreaming.java:83), which has no missing parents
INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(3856) called with curMem=0, maxMem=1893865881
INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0 stored as values in memory (estimated size 3.8 KB, free 1806.1 MB)
INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(2225) called with curMem=3856, maxMem=1893865881
INFO : org.apache.spark.storage.MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.2 KB, free 1806.1 MB)
INFO : org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in memory on localhost:51637 (size: 2.2 KB, free: 1806.1 MB)
INFO : org.apache.spark.SparkContext - Created broadcast 0 from broadcast at DAGScheduler.scala:861
INFO : org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at KafkaURLStreaming.java:83)
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0 with 1 tasks
INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 2026 bytes)
INFO : org.apache.spark.executor.Executor - Running task 0.0 in stage 0.0 (TID 0)
INFO : org.apache.spark.streaming.kafka.KafkaRDD - Computing topic test11, partition 0 offsets 36 -> 37
INFO : kafka.utils.VerifiableProperties - Verifying properties
INFO : kafka.utils.VerifiableProperties - Property fetch.message.max.bytes is overridden to 1073741824
INFO : kafka.utils.VerifiableProperties - Property group.id is overridden to 
INFO : kafka.utils.VerifiableProperties - Property zookeeper.connect is overridden to localhost:2181
INFO : com.markmonitor.antifraud.ce.KafkaURLStreaming - #################  Input json stream data  ################# one test message
INFO : org.apache.spark.executor.Executor - Finished task 0.0 in stage 0.0 (TID 0). 972 bytes result sent to driver
INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 0 (foreachRDD at KafkaURLStreaming.java:90) finished in 0.133 s
INFO : org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 0.0 (TID 0) in 116 ms on localhost (1/1)
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 0.0, whose tasks have all completed, from pool 
INFO : org.apache.spark.scheduler.DAGScheduler - Job 0 finished: foreachRDD at KafkaURLStreaming.java:90, took 0.496657 s
INFO : org.apache.spark.ContextCleaner - Cleaned accumulator 2
INFO : org.apache.spark.storage.BlockManagerInfo - Removed broadcast_0_piece0 on localhost:51637 in memory (size: 2.2 KB, free: 1806.1 MB)
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.tip.id is deprecated. Instead, use mapreduce.task.id
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.job.id is deprecated. Instead, use mapreduce.job.id
INFO : org.apache.spark.SparkContext - Starting job: foreachRDD at KafkaURLStreaming.java:90
INFO : org.apache.spark.scheduler.DAGScheduler - Got job 1 (foreachRDD at KafkaURLStreaming.java:90) with 1 output partitions
INFO : org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 1(foreachRDD at KafkaURLStreaming.java:90)
INFO : org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List()
INFO : org.apache.spark.scheduler.DAGScheduler - Missing parents: List()
INFO : org.apache.spark.scheduler.DAGScheduler - Submitting ResultStage 1 (MapPartitionsRDD[2] at foreachRDD at KafkaURLStreaming.java:90), which has no missing parents
INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(97104) called with curMem=0, maxMem=1893865881
INFO : org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 94.8 KB, free 1806.0 MB)
INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(32204) called with curMem=97104, maxMem=1893865881
INFO : org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 31.4 KB, free 1806.0 MB)
INFO : org.apache.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:51637 (size: 31.4 KB, free: 1806.1 MB)
INFO : org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:861
INFO : org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[2] at foreachRDD at KafkaURLStreaming.java:90)
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, ANY, 2026 bytes)
INFO : org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
INFO : org.apache.hadoop.conf.Configuration.deprecation - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
INFO : org.apache.spark.streaming.kafka.KafkaRDD - Computing topic test11, partition 0 offsets 36 -> 37
INFO : kafka.utils.VerifiableProperties - Verifying properties
INFO : kafka.utils.VerifiableProperties - Property fetch.message.max.bytes is overridden to 1073741824
INFO : kafka.utils.VerifiableProperties - Property group.id is overridden to 
INFO : kafka.utils.VerifiableProperties - Property zookeeper.connect is overridden to localhost:2181
INFO : com.markmonitor.antifraud.ce.KafkaURLStreaming - #################  Input json stream data  ################# one test message
INFO : org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_201601050824_0001_m_000000_1' to hdfs://quickstart.cloudera:8020/user/cloudera/testDirJan4/test1.text/_temporary/0/task_201601050824_0001_m_000000
INFO : org.apache.spark.mapred.SparkHadoopMapRedUtil - attempt_201601050824_0001_m_000000_1: Committed
INFO : org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 933 bytes result sent to driver
INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 1 (foreachRDD at KafkaURLStreaming.java:90) finished in 0.758 s
INFO : org.apache.spark.scheduler.DAGScheduler - Job 1 finished: foreachRDD at KafkaURLStreaming.java:90, took 0.888585 s
INFO : org.apache.spark.scheduler.TaskSetManager - Finished task 0.0 in stage 1.0 (TID 1) in 760 ms on localhost (1/1)
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Removed TaskSet 1.0, whose tasks have all completed, from pool 
 &&&&&&&&&&&&&&&&&&&&& AFTER COUNT OF ACCUMULATOR IS 2

But if I comment the saveAsText then I am getting correct count as one for each input.

INFO : org.apache.spark.storage.MemoryStore - ensureFreeSpace(2227) called with curMem=9937, maxMem=1893865881
INFO : org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 1806.1 MB)
INFO : org.apache.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on localhost:58397 (size: 2.2 KB, free: 1806.1 MB)
INFO : org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:861
INFO : org.apache.spark.scheduler.DAGScheduler - Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[3] at map at KafkaURLStreaming.java:83)
INFO : org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 1.0 with 1 tasks
INFO : org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 1.0 (TID 1, localhost, ANY, 2026 bytes)
INFO : org.apache.spark.executor.Executor - Running task 0.0 in stage 1.0 (TID 1)
INFO : org.apache.spark.streaming.kafka.KafkaRDD - Computing topic test11, partition 0 offsets 37 -> 38
INFO : kafka.utils.VerifiableProperties - Verifying properties
INFO : kafka.utils.VerifiableProperties - Property fetch.message.max.bytes is overridden to 1073741824
INFO : kafka.utils.VerifiableProperties - Property group.id is overridden to 
INFO : kafka.utils.VerifiableProperties - Property zookeeper.connect is overridden to localhost:2181
INFO : com.markmonitor.antifraud.ce.KafkaURLStreaming - #################  Input json stream data  ################# one test message without saveAs
INFO : org.apache.spark.executor.Executor - Finished task 0.0 in stage 1.0 (TID 1). 987 bytes result sent to driver
INFO : org.apache.spark.scheduler.DAGScheduler - ResultStage 1 (foreachRDD at KafkaURLStreaming.java:90) finished in 0.103 s
INFO : org.apache.spark.scheduler.DAGScheduler - Job 1 finished: foreachRDD at KafkaURLStreaming.java:90, took 0.151210 s
&&&&&&&&&&&&&&&&&&&&& AFTER COUNT OF ACCUMULATOR IS 1

Answer 1

spark Trasformation 中没有任何保证，累加器只会被调用一次。

如需更多深度答案，请阅读火花指南：

http://spark.apache.org/docs/latest/programming-guide.html#accumulators-a-nameaccumlinka

刚从上面摘录：

对于仅在操作内执行的累加器更新，Spark保证每个任务对累加器的更新仅应用一次，即重新启动的任务不会更新该值。在转换中，用户应该知道，如果重新执行任务或作业阶段，每个任务的更新可能会被多次应用。

希望它能帮到你

使用具有Spark Streaming的累加器时的双重计数

1 个答案: