spark streaming mapPartition / map是重新运行的方法吗?

时间:2017-06-27 04:51:23

标签: java apache-spark spark-streaming yarn

我构建了一个用于运行java base SparkStreaming

的yarn cluster(docker)

我不知道为什么spark会在KafkaStreaming中重新运行相同的mapPartition函数。

执行者1本地转移。
我发现执行者2从远程传输数据两次。

我是否需要配置一些让spark不重新运行相同功能的东西?

```

JavaPairInputDStream<String, String> inputDStream = KafkaUtils.createStream(jssc,String.class,String.class,StringDecoder.class,StringDecoder.class,kafkaConfig,topic,StorageLevel.MEMORY_ONLY_SER_2());
...
JavaDStream<Data> mapPartitions = inputDStream.mapPartitions(new FlatMapFunction<Iterator<Tuple2<String, String>>, Data>() {
                private static final long serialVersionUID = -640088436146512943L;

                @Override
                public Iterator<Data> call(Iterator<Tuple2<String, String>> t) throws Exception {
                    List<Data> result = new ArrayList<>();
                    Logger log = Logger.getLogger(this.getClass());
                    while (t.hasNext()) {
                        Tuple2<String,String> tuple = t.next();
                        log.info("key="+tuple._1())
                        Data d = new Data();
                        String[] arr =tuple._2().split(",");
                        d.setKey(tuple._1());
                        d.setUser(arr[0]);
                    ......//do somthing
                        result.add(data);
                    }
                    return result.iterator();

                }

            });

```

我发现在不同的Executor中打印了相同的stderr日志 ,这意味着不仅要运行一次

驱动程序日志可以找到块关键字

  

“input-0-1498129808200”

19:10:08.004 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3057
19:10:08.004 INFO org.apache.spark.rdd.MapPartitionsRDD:54 Removing RDD 3056 from persistence list
19:10:08.004 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3056
19:10:08.004 INFO org.apache.spark.rdd.BlockRDD:54 Removing RDD 3055 from persistence list
19:10:08.004 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3055
19:10:08.004 INFO org.apache.spark.streaming.kafka.KafkaInputDStream:54 Removing blocks of RDD BlockRDD[3055] at createStream at KafkaStreaming.java:146 of time 1498129808000 ms
19:10:08.005 INFO org.apache.spark.streaming.scheduler.ReceivedBlockTracker:54 Deleting batches: 1498129807000 ms
19:10:08.005 INFO org.apache.spark.streaming.scheduler.InputInfoTracker:54 remove old batch metadata: 1498129807000 ms
19:10:08.403 INFO org.apache.spark.storage.BlockManagerInfo:54 Added input-0-1498129808200 in memory on slave2:50830 (size: 245.0 B, free: 983.0 MB)
19:10:08.414 INFO org.apache.spark.storage.BlockManagerInfo:54 Added input-0-1498129808200 in memory on slave1:41063 (size: 245.0 B, free: 983.1 MB)
19:10:08.501 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Added jobs for time 1498129808500 ms
19:10:08.501 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129808500 ms.0 from job set of time 1498129808500 ms
19:10:08.502 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129808500 ms.0 from job set of time 1498129808500 ms
19:10:08.502 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129808500 ms.1 from job set of time 1498129808500 ms
19:10:08.502 DEBUG example.spark.streaming.KafkaStreaming$VoidFunctionImpl:201 run foreachMapPartitionsRDD[3062] at mapPartitions at KafkaStreaming.java:170
19:10:08.502 INFO org.apache.spark.scheduler.DAGScheduler:54 Got job 2044 (foreachPartitionAsync at KafkaStreaming.java:239) with 1 output partitions
19:10:08.503 INFO org.apache.spark.scheduler.DAGScheduler:54 Final stage: ResultStage 15 (foreachPartitionAsync at KafkaStreaming.java:239)
19:10:08.503 INFO org.apache.spark.scheduler.DAGScheduler:54 Parents of final stage: List()
19:10:08.503 INFO org.apache.spark.scheduler.DAGScheduler:54 Missing parents: List()
19:10:08.503 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting ResultStage 15 (MapPartitionsRDD[3063] at mapPartitions at KafkaStreaming.java:172), which has no missing parents
19:10:08.503 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129808500 ms.1 from job set of time 1498129808500 ms
19:10:08.503 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129808500 ms.2 from job set of time 1498129808500 ms
19:10:08.505 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_15 stored as values in memory (estimated size 3.4 KB, free 1105.8 MB)
19:10:08.506 INFO org.apache.spark.SparkContext:54 Starting job: print at KafkaStreaming.java:177
19:10:08.508 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.2 KB, free 1105.8 MB)
19:10:08.508 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_15_piece0 in memory on slave2:53474 (size: 2.2 KB, free: 1105.9 MB)
19:10:08.508 INFO org.apache.spark.SparkContext:54 Created broadcast 15 from broadcast at DAGScheduler.scala:996
19:10:08.509 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting 1 missing tasks from ResultStage 15 (MapPartitionsRDD[3063] at mapPartitions at KafkaStreaming.java:172)
19:10:08.509 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Adding task set 15.0 with 1 tasks
19:10:08.509 INFO org.apache.spark.scheduler.FairSchedulableBuilder:54 Added task set TaskSet_15.0 tasks to pool default
19:10:08.510 INFO org.apache.spark.scheduler.DAGScheduler:54 Got job 2045 (foreachPartitionAsync at KafkaStreaming.java:202) with 1 output partitions
19:10:08.510 INFO org.apache.spark.scheduler.DAGScheduler:54 Final stage: ResultStage 16 (foreachPartitionAsync at KafkaStreaming.java:202)
19:10:08.510 INFO org.apache.spark.scheduler.DAGScheduler:54 Parents of final stage: List()
19:10:08.510 INFO org.apache.spark.scheduler.TaskSetManager:54 Starting task 0.0 in stage 15.0 (TID 83, slave2, executor 1, partition 0, NODE_LOCAL, 6301 bytes)
19:10:08.510 INFO org.apache.spark.scheduler.DAGScheduler:54 Missing parents: List()
19:10:08.510 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting ResultStage 16 (MapPartitionsRDD[3062] at mapPartitions at KafkaStreaming.java:170), which has no missing parents
19:10:08.511 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_16 stored as values in memory (estimated size 3.0 KB, free 1105.8 MB)
19:10:08.514 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_16_piece0 stored as bytes in memory (estimated size 2.0 KB, free 1105.8 MB)
19:10:08.514 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_16_piece0 in memory on slave2:53474 (size: 2.0 KB, free: 1105.9 MB)
19:10:08.515 INFO org.apache.spark.SparkContext:54 Created broadcast 16 from broadcast at DAGScheduler.scala:996
19:10:08.515 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting 1 missing tasks from ResultStage 16 (MapPartitionsRDD[3062] at mapPartitions at KafkaStreaming.java:170)
19:10:08.515 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Adding task set 16.0 with 1 tasks
19:10:08.515 INFO org.apache.spark.scheduler.FairSchedulableBuilder:54 Added task set TaskSet_16.0 tasks to pool default
19:10:08.516 INFO org.apache.spark.scheduler.DAGScheduler:54 Got job 2046 (print at KafkaStreaming.java:177) with 1 output partitions
19:10:08.516 INFO org.apache.spark.scheduler.DAGScheduler:54 Final stage: ResultStage 17 (print at KafkaStreaming.java:177)
19:10:08.517 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_15_piece0 in memory on slave2:50830 (size: 2.2 KB, free: 983.0 MB)
19:10:08.516 INFO org.apache.spark.scheduler.TaskSetManager:54 Starting task 0.0 in stage 16.0 (TID 84, slave2, executor 2, partition 0, NODE_LOCAL, 6301 bytes)
19:10:08.517 INFO org.apache.spark.scheduler.DAGScheduler:54 Parents of final stage: List()
19:10:08.518 INFO org.apache.spark.scheduler.DAGScheduler:54 Missing parents: List()
19:10:08.518 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting ResultStage 17 (MapPartitionsRDD[3063] at mapPartitions at KafkaStreaming.java:172), which has no missing parents
19:10:08.519 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_17 stored as values in memory (estimated size 3.2 KB, free 1105.8 MB)
19:10:08.522 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_17_piece0 stored as bytes in memory (estimated size 2.1 KB, free 1105.8 MB)
19:10:08.522 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_17_piece0 in memory on slave2:53474 (size: 2.1 KB, free: 1105.9 MB)
19:10:08.523 INFO org.apache.spark.SparkContext:54 Created broadcast 17 from broadcast at DAGScheduler.scala:996
19:10:08.523 INFO org.apache.spark.scheduler.DAGScheduler:54 Submitting 1 missing tasks from ResultStage 17 (MapPartitionsRDD[3063] at mapPartitions at KafkaStreaming.java:172)
19:10:08.523 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Adding task set 17.0 with 1 tasks
19:10:08.523 INFO org.apache.spark.scheduler.FairSchedulableBuilder:54 Added task set TaskSet_17.0 tasks to pool default
19:10:08.524 INFO org.apache.spark.scheduler.TaskSetManager:54 Starting task 0.0 in stage 17.0 (TID 85, slave2, executor 2, partition 0, NODE_LOCAL, 6904 bytes)
19:10:08.526 INFO org.apache.spark.storage.BlockManagerInfo:54 Added rdd_3061_0 in memory on slave2:50830 (size: 245.0 B, free: 983.0 MB)
19:10:08.528 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_16_piece0 in memory on slave1:41063 (size: 2.0 KB, free: 983.1 MB)
19:10:08.534 INFO org.apache.spark.storage.BlockManagerInfo:54 Added broadcast_17_piece0 in memory on slave1:41063 (size: 2.1 KB, free: 983.1 MB)
19:10:08.547 INFO org.apache.spark.scheduler.TaskSetManager:54 Finished task 0.0 in stage 17.0 (TID 85) in 23 ms on slave2 (executor 2) (1/1)
19:10:08.547 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Removed TaskSet 17.0, whose tasks have all completed, from pool default
19:10:08.547 INFO org.apache.spark.scheduler.DAGScheduler:54 ResultStage 17 (print at KafkaStreaming.java:177) finished in 0.023 s
19:10:08.547 INFO org.apache.spark.scheduler.DAGScheduler:54 Job 2046 finished: print at KafkaStreaming.java:177, took 0.041731 s
19:10:08.548 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129808500 ms.2 from job set of time 1498129808500 ms
19:10:08.548 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Total delay: 0.048 s for time 1498129808500 ms (execution: 0.047 s)
19:10:08.548 INFO org.apache.spark.rdd.MapPartitionsRDD:54 Removing RDD 3060 from persistence list
19:10:08.548 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3060
19:10:08.549 INFO org.apache.spark.rdd.MapPartitionsRDD:54 Removing RDD 3059 from persistence list
19:10:08.549 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3059
19:10:08.549 INFO org.apache.spark.rdd.BlockRDD:54 Removing RDD 3058 from persistence list
19:10:08.549 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3058
19:10:08.549 INFO org.apache.spark.streaming.kafka.KafkaInputDStream:54 Removing blocks of RDD BlockRDD[3058] at createStream at KafkaStreaming.java:146 of time 1498129808500 ms
19:10:08.550 INFO org.apache.spark.streaming.scheduler.ReceivedBlockTracker:54 Deleting batches: 1498129807500 ms
19:10:08.550 INFO org.apache.spark.streaming.scheduler.InputInfoTracker:54 remove old batch metadata: 1498129807500 ms
19:10:08.588 INFO org.apache.spark.scheduler.TaskSetManager:54 Finished task 0.0 in stage 16.0 (TID 84) in 72 ms on slave1 (executor 2) (1/1)
19:10:08.588 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Removed TaskSet 16.0, whose tasks have all completed, from pool default
19:10:08.588 INFO org.apache.spark.scheduler.DAGScheduler:54 ResultStage 16 (foreachPartitionAsync at KafkaStreaming.java:202) finished in 0.073 s
19:10:08.620 INFO org.apache.spark.scheduler.TaskSetManager:54 Finished task 0.0 in stage 15.0 (TID 83) in 110 ms on slave2 (executor 1) (1/1)
19:10:08.620 INFO org.apache.spark.scheduler.cluster.YarnClusterScheduler:54 Removed TaskSet 15.0, whose tasks have all completed, from pool default
19:10:08.620 INFO org.apache.spark.scheduler.DAGScheduler:54 ResultStage 15 (foreachPartitionAsync at KafkaStreaming.java:239) finished in 0.111 s
19:10:09.002 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Added jobs for time 1498129809000 ms
19:10:09.002 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129809000 ms.0 from job set of time 1498129809000 ms
19:10:09.003 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129809000 ms.0 from job set of time 1498129809000 ms
19:10:09.003 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129809000 ms.1 from job set of time 1498129809000 ms
19:10:09.003 DEBUG example.spark.streaming.KafkaStreaming$VoidFunctionImpl:201 run foreachMapPartitionsRDD[3065] at mapPartitions at KafkaStreaming.java:170
19:10:09.003 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129809000 ms.1 from job set of time 1498129809000 ms
19:10:09.004 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Starting job streaming job 1498129809000 ms.2 from job set of time 1498129809000 ms
19:10:09.004 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Finished job streaming job 1498129809000 ms.2 from job set of time 1498129809000 ms
19:10:09.004 INFO org.apache.spark.rdd.MapPartitionsRDD:54 Removing RDD 3063 from persistence list
19:10:09.004 INFO org.apache.spark.streaming.scheduler.JobScheduler:54 Total delay: 0.004 s for time 1498129809000 ms (execution: 0.002 s)
19:10:09.004 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3063
19:10:09.004 INFO org.apache.spark.rdd.MapPartitionsRDD:54 Removing RDD 3062 from persistence list
19:10:09.004 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3062
19:10:09.004 INFO org.apache.spark.rdd.BlockRDD:54 Removing RDD 3061 from persistence list
19:10:09.005 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3061
19:10:09.005 INFO org.apache.spark.streaming.kafka.KafkaInputDStream:54 Removing blocks of RDD BlockRDD[3061] at createStream at KafkaStreaming.java:146 of time 1498129809000 ms
19:10:09.005 INFO org.apache.spark.streaming.scheduler.ReceivedBlockTracker:54 Deleting batches: 1498129808000 ms
19:10:09.005 INFO org.apache.spark.streaming.scheduler.InputInfoTracker:54 remove old batch metadata: 1498129808000 ms
19:10:09.006 INFO org.apache.spark.storage.BlockManagerInfo:54 Removed input-0-1498129808200 on slave2:50830 in memory (size: 245.0 B, free: 983.0 MB)
19:10:09.006 INFO org.apache.spark.storage.BlockManagerInfo:54 Removed input-0-1498129808200 on slave1:41063 in memory (size: 245.0 B, free: 983.1 MB)

执行人1 日志中的关键词

  

找到块输入-0-1498129808200 本地
  键= 170622191007573

19:10:07.506 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3053
19:10:07.506 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3052
19:10:08.005 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3057
19:10:08.006 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3056
19:10:08.006 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3055
19:10:08.402 INFO org.apache.spark.storage.memory.MemoryStore:54 Block input-0-1498129808200 stored as bytes in memory (estimated size 245.0 B, free 983.0 MB)
19:10:08.416 INFO org.apache.spark.streaming.receiver.BlockGenerator:54 Pushed block input-0-1498129808200
19:10:08.511 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend:54 Got assigned task 83
19:10:08.512 INFO org.apache.spark.executor.Executor:54 Running task 0.0 in stage 15.0 (TID 83)
19:10:08.513 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Started reading broadcast variable 15
19:10:08.515 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_15_piece0 stored as bytes in memory (estimated size 2.2 KB, free 983.0 MB)
19:10:08.518 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Reading broadcast variable 15 took 5 ms
19:10:08.520 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_15 stored as values in memory (estimated size 3.4 KB, free 983.0 MB)
19:10:08.523 INFO org.apache.spark.storage.BlockManager:54 Found block input-0-1498129808200 locally
19:10:08.525 INFO org.apache.spark.storage.memory.MemoryStore:54 Block rdd_3061_0 stored as bytes in memory (estimated size 245.0 B, free 983.0 MB)
19:10:08.528 DEBUG example.spark.streaming.KafkaStreaming$FlatMapFunctionImpl:284 key=170622191007573
19:10:08.549 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3060
19:10:08.549 INFO org.apache.spark.streaming.receiver.ReceiverSupervisorImpl:54 Received a new rate limit: 100.
19:10:08.549 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3059
19:10:08.549 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3058

executor2
日志中的关键词是打印两次

  

找到阻止rdd_3061_0 远程
  键= 170622191007573

19:10:07.506 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3053
19:10:07.506 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3052
19:10:08.005 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3057
19:10:08.006 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3056
19:10:08.006 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3055
19:10:08.407 INFO org.apache.spark.storage.memory.MemoryStore:54 Block input-0-1498129808200 stored as bytes in memory (estimated size 245.0 B, free 983.1 MB)
19:10:08.519 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend:54 Got assigned task 84
19:10:08.520 INFO org.apache.spark.executor.Executor:54 Running task 0.0 in stage 16.0 (TID 84)
19:10:08.521 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Started reading broadcast variable 16
19:10:08.526 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend:54 Got assigned task 85
19:10:08.527 INFO org.apache.spark.executor.Executor:54 Running task 0.0 in stage 17.0 (TID 85)
19:10:08.528 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_16_piece0 stored as bytes in memory (estimated size 2.0 KB, free 983.1 MB)
19:10:08.530 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Reading broadcast variable 16 took 9 ms
19:10:08.532 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_16 stored as values in memory (estimated size 3.0 KB, free 983.1 MB)
19:10:08.532 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Started reading broadcast variable 17
19:10:08.534 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_17_piece0 stored as bytes in memory (estimated size 2.1 KB, free 983.1 MB)
19:10:08.536 INFO org.apache.spark.broadcast.TorrentBroadcast:54 Reading broadcast variable 17 took 4 ms
19:10:08.536 INFO org.apache.spark.storage.BlockManager:54 Found block rdd_3061_0 remotely
19:10:08.536 DEBUG example.spark.streaming.KafkaStreaming$FlatMapFunctionImpl:284 key=170622191007573
19:10:08.537 INFO org.apache.spark.storage.memory.MemoryStore:54 Block broadcast_17 stored as values in memory (estimated size 3.2 KB, free 983.1 MB)
19:10:08.541 INFO org.apache.spark.storage.BlockManager:54 Found block rdd_3061_0 remotely
19:10:08.542 DEBUG example.spark.streaming.KafkaStreaming$FlatMapFunctionImpl:284 key=170622191007573
19:10:08.547 INFO org.apache.spark.executor.Executor:54 Finished task 0.0 in stage 17.0 (TID 85). 1963 bytes result sent to driver
19:10:08.550 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3060
19:10:08.550 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3059
19:10:08.552 INFO org.apache.spark.storage.BlockManager:54 Removing RDD 3058

1 个答案:

答案 0 :(得分:0)

因为你从kafka读取数据,所以流将通过spark监听。因此,当从kafka读取流时,处理流的工作将重新运行。希望答案可以帮到你。