每个执行程序中的单个长时间运行任务

时间:2018-01-27 09:33:11

标签: apache-spark

很抱歉,如果这个问题看起来无效,我试图找到调试任务处理时间的一般指导,但还没有找到。我认为我的问题是一个已知问题,所以任何帮助调试问题或理解问题(相关讨论或博客文章)都会回答我的问题。

我制作了多个流式火花工作,几乎所有人都遇到同样的问题;每个执行程序中的一个任务比其他所有任务花费更长的时间:

Tasks time distribution

但任务的输入大小并没有那么不同: Part of task details

我的工作流程是直接Kafka流源平面映射(mapParitionsWithPair ( flatMap )),有四十个分区,可以从事件中生成更多对象,然后减少它们(reduceByKey)并将聚合值保存到某个DB:

enter image description here

任务时间表数字用于减少阶段。

这是一个基于Apache Mesos的集群,每个节点有两个节点和两个核心,所有作业的第二阶段都有不均匀的任务处理时间分布。

更新

  • 我用Java reduce操作(实际上是Kotlin Sequence操作)替换了reduceByKey,但仍然出现同样的问题。
  • 重播工作后,我意识到这个问题确实对更大的输入造成了很大的伤害;它在1.8到4.8分钟内处理160K事件(更糟糕的是每秒580个事件),虽然仍有一些任务需要更长的时间,但最终效果远比处理速率在660到54之间的小输入有害。对于这两种情况,长时间运行的任务都会获得相同的时间(约41秒)
  • 即使增加RAM也存在问题。执行者现在拥有%30的可用内存。

更新

我通过在每个分区中使用Java 8 Stream reduce来改变工作流程以避免数据混乱。这是改变了工作的DAG:

Changed Job DAG to not shuffle data

我将批处理间隔增加到20秒并添加了更多节点;现在,不仅有一个缓慢的任务,而且任务更慢,速度更快,但是:

  1. 现在,它比以前版本的间隔更短,速度更快
  2. 我希望CPU使用率总是很高,特别是对于mapPartition中的操作,但并非总是如此。
  3. Task follow for changed job to use Stream reduce operation in each partition

    只是在每个分区中放置一些记录实际操作的内容,我看到奇怪的是,有时候任务很慢,有时很快。当任务进行缓慢时,CPU处于空闲状态,我看不到任何网络或CPU I / O阻塞。内存使用率恒定为%50。这里提到执行程序日志:

    started processing partitioned input: thread 99
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 99 took 40615ms
    finished processing partitioned input: thread 98 took 40469ms
    started processing partitioned input: thread 98
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 98 took 40476ms
    finished processing partitioned input: thread 99 took 40523ms
    started processing partitioned input: thread 98
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 98 40465ms
    finished processing partitioned input: thread 99 40379ms
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 98 468
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 99 525
    started processing partitioned input: thread 99
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 98 738
    finished processing partitioned input: thread 99 790
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 98 took 558
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 99 took 461
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 98 took 483
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 99 took 513
    started processing partitioned input: thread 98
    finished processing partitioned input: thread 98 took 485
    started processing partitioned input: thread 99
    finished processing partitioned input: thread 99 took 454
    

    以上日志仅用于将一些传入输入映射到对象以便在Cassandra中保存,并且不包括保存到Cassandra的时间;这是保存操作的日志,它总是很快并且不会让CPU空闲:

    18/02/07 07:41:47 INFO Executor: Running task 17.0 in stage 5.0 (TID 207)
    18/02/07 07:41:47 INFO TorrentBroadcast: Started reading broadcast variable 5
    18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 7.8 KB, free 1177.1 MB)
    18/02/07 07:41:47 INFO TorrentBroadcast: Reading broadcast variable 5 took 33 ms
    18/02/07 07:41:47 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 16.4 KB, free 1177.1 MB)
    18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_2 locally
    18/02/07 07:41:47 INFO BlockManager: Found block rdd_30_17 locally
    18/02/07 07:42:02 INFO TableWriter: Wrote 28926 rows to keyspace.table in 15.749 s.
    18/02/07 07:42:02 INFO Executor: Finished task 17.0 in stage 5.0 (TID 207). 923 bytes result sent to driver
    18/02/07 07:42:02 INFO CoarseGrainedExecutorBackend: Got assigned task 209
    18/02/07 07:42:02 INFO Executor: Running task 18.0 in stage 5.0 (TID 209)
    18/02/07 07:42:02 INFO BlockManager: Found block rdd_30_18 locally
    18/02/07 07:42:03 INFO TableWriter: Wrote 29288 rows to keyspace.table in 16.042 s.
    18/02/07 07:42:03 INFO Executor: Finished task 2.0 in stage 5.0 (TID 203). 1713 bytes result sent to driver
    18/02/07 07:42:03 INFO CoarseGrainedExecutorBackend: Got assigned task 211
    18/02/07 07:42:03 INFO Executor: Running task 21.0 in stage 5.0 (TID 211)
    18/02/07 07:42:03 INFO BlockManager: Found block rdd_30_21 locally
    18/02/07 07:42:19 INFO TableWriter: Wrote 29315 rows to keyspace.table in 16.308 s.
    18/02/07 07:42:19 INFO Executor: Finished task 21.0 in stage 5.0 (TID 211). 923 bytes result sent to driver
    18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 217
    18/02/07 07:42:19 INFO Executor: Running task 24.0 in stage 5.0 (TID 217)
    18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_24 locally
    18/02/07 07:42:19 INFO TableWriter: Wrote 29422 rows to keyspace.table in 16.783 s.
    18/02/07 07:42:19 INFO Executor: Finished task 18.0 in stage 5.0 (TID 209). 923 bytes result sent to driver
    18/02/07 07:42:19 INFO CoarseGrainedExecutorBackend: Got assigned task 218
    18/02/07 07:42:19 INFO Executor: Running task 25.0 in stage 5.0 (TID 218)
    18/02/07 07:42:19 INFO BlockManager: Found block rdd_30_25 locally
    18/02/07 07:42:35 INFO TableWriter: Wrote 29427 rows to keyspace.table in 16.509 s.
    18/02/07 07:42:35 INFO Executor: Finished task 24.0 in stage 5.0 (TID 217). 923 bytes result sent to driver
    18/02/07 07:42:35 INFO CoarseGrainedExecutorBackend: Got assigned task 225
    

0 个答案:

没有答案