简单的RDD作业需要很长时间才能完成

时间:2016-04-22 11:03:29

标签: windows hadoop apache-spark pyspark jupyter-notebook

问题

以下简单的减少作业大约需要8秒才能在我的机器上运行本地火花:

import pyspark
sc = pyspark.SparkContext('local[*]')
rdd3 = sc.parallelize(range(10))

import time
s = time.time()
rdd3.reduce(lambda x, y: x + y)
print(time.time() - s)

> 8.376211166381836

如果我在专用服务器上运行相同的作业,则任务按预期在几毫秒内完成,但在我的机器上大约需要8秒。无论多么小,每个并行任务似乎Spark都会损失大约5秒钟。对于在许多任务中分割的较大文件,该过程需要几分钟(如果不是几小时),而它应该只需几秒钟。 任何人都知道造成这种延迟的原因是什么?

请注意,此延迟不仅是由减少作业引起的,而且是由导致多个并行任务的RDD(映射,计数等)上的大多数操作引起的。

更新

启动pyspark(pyspark --master local[2])会得到与上面类似的结果:2,8秒,但Spark这次分配了2个任务中的作业,因此延迟的顺序相同。

更新2:

我也尝试过使用Scala cmd(spark-shell --master local[2])。

 val df = sc.parallelize(1 to 10)
 df.reduce((x,y) => (x+y))

这次它应该完成:0,025909秒。这项工作也分为两个任务。 所以这个问题与pyspark有关吗?

额外调试日志

用于上述执行的Python Jupyter Notebook日志:

[I 12:38:47.067 NotebookApp] Serving notebooks from local directory: [SNIP]
[I 12:38:47.067 NotebookApp] 0 active kernels 
[I 12:38:47.067 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/
[I 12:38:47.067 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 12:38:54.649 NotebookApp] 404 GET /undefined (127.0.0.1) 12.01ms referer=None
[I 12:38:54.666 NotebookApp] Creating new notebook in 
[I 12:38:55.491 NotebookApp] Kernel started: 971c2090-fd4d-40ba-9348-36823b793126
16/04/22 12:38:58 INFO spark.SparkContext: Running Spark version 1.6.1
16/04/22 12:38:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/22 12:38:58 INFO spark.SecurityManager: Changing view acls to: [SNIP]
16/04/22 12:38:58 INFO spark.SecurityManager: Changing modify acls to: [SNIP]
16/04/22 12:38:58 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set([SNIP]); users with modify permissions: Set([SNIP])
16/04/22 12:38:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 54021.
16/04/22 12:38:59 INFO slf4j.Slf4jLogger: Slf4jLogger started
16/04/22 12:38:59 INFO Remoting: Starting remoting
16/04/22 12:38:59 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@[SNIP]:54034]
16/04/22 12:38:59 INFO util.Utils: Successfully started service 'sparkDriverActorSystem' on port 54034.
16/04/22 12:38:59 INFO spark.SparkEnv: Registering MapOutputTracker
16/04/22 12:38:59 INFO spark.SparkEnv: Registering BlockManagerMaster
16/04/22 12:38:59 INFO storage.DiskBlockManager: Created local directory at [SNIP]\Local\Temp\blockmgr-70b1b49f-f740-429a-b5f0-6fac91f44dd6
16/04/22 12:38:59 INFO storage.MemoryStore: MemoryStore started with capacity 511.1 MB
16/04/22 12:39:00 INFO spark.SparkEnv: Registering OutputCommitCoordinator
16/04/22 12:39:00 INFO server.Server: jetty-8.y.z-SNAPSHOT
16/04/22 12:39:00 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/04/22 12:39:00 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
16/04/22 12:39:00 INFO ui.SparkUI: Started SparkUI at http://[SNIP]:4040
16/04/22 12:39:00 INFO executor.Executor: Starting executor ID driver on host localhost
16/04/22 12:39:00 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54071.
16/04/22 12:39:00 INFO netty.NettyBlockTransferService: Server created on 54071
16/04/22 12:39:00 INFO storage.BlockManagerMaster: Trying to register BlockManager
16/04/22 12:39:00 INFO storage.BlockManagerMasterEndpoint: Registering block manager localhost:54071 with 511.1 MB RAM, BlockManagerId(driver, localhost, 54071)
16/04/22 12:39:00 INFO storage.BlockManagerMaster: Registered BlockManager
16/04/22 12:39:00 INFO spark.SparkContext: Starting job: reduce at <ipython-input-1-1383e3e82ca4>:8
16/04/22 12:39:00 INFO scheduler.DAGScheduler: Got job 0 (reduce at <ipython-input-1-1383e3e82ca4>:8) with 8 output partitions
16/04/22 12:39:00 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at <ipython-input-1-1383e3e82ca4>:8)
16/04/22 12:39:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/04/22 12:39:01 INFO scheduler.DAGScheduler: Missing parents: List()
16/04/22 12:39:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at reduce at <ipython-input-1-1383e3e82ca4>:8), which has no missing parents
16/04/22 12:39:01 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.3 KB, free 4.3 KB)
16/04/22 12:39:01 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.9 KB, free 7.2 KB)
16/04/22 12:39:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54071 (size: 2.9 KB, free: 511.1 MB)
16/04/22 12:39:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
16/04/22 12:39:01 INFO scheduler.DAGScheduler: Submitting 8 missing tasks from ResultStage 0 (PythonRDD[1] at reduce at <ipython-input-1-1383e3e82ca4>:8)
16/04/22 12:39:01 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, partition 1,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, partition 2,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, partition 3,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, localhost, partition 4,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, localhost, partition 5,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, localhost, partition 6,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, localhost, partition 7,PROCESS_LOCAL, 2064 bytes)
16/04/22 12:39:01 INFO executor.Executor: Running task 6.0 in stage 0.0 (TID 6)
16/04/22 12:39:01 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
16/04/22 12:39:01 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 7)
16/04/22 12:39:01 INFO executor.Executor: Running task 4.0 in stage 0.0 (TID 4)
16/04/22 12:39:01 INFO executor.Executor: Running task 2.0 in stage 0.0 (TID 2)
16/04/22 12:39:01 INFO executor.Executor: Running task 3.0 in stage 0.0 (TID 3)
16/04/22 12:39:01 INFO executor.Executor: Running task 5.0 in stage 0.0 (TID 5)
16/04/22 12:39:01 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/04/22 12:39:02 INFO python.PythonRunner: Times: total = 860, boot = 854, init = 6, finish = 0
16/04/22 12:39:02 INFO executor.Executor: Finished task 3.0 in stage 0.0 (TID 3). 995 bytes result sent to driver
16/04/22 12:39:02 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 948 ms on localhost (1/8)
16/04/22 12:39:03 INFO python.PythonRunner: Times: total = 1757, boot = 1755, init = 2, finish = 0
16/04/22 12:39:03 INFO executor.Executor: Finished task 4.0 in stage 0.0 (TID 4). 995 bytes result sent to driver
16/04/22 12:39:03 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 1832 ms on localhost (2/8)
16/04/22 12:39:03 INFO python.PythonRunner: Times: total = 2615, boot = 2614, init = 1, finish = 0
16/04/22 12:39:03 INFO executor.Executor: Finished task 7.0 in stage 0.0 (TID 7). 995 bytes result sent to driver
16/04/22 12:39:03 INFO scheduler.TaskSetManager: Finished task 7.0 in stage 0.0 (TID 7) in 2687 ms on localhost (3/8)
16/04/22 12:39:04 INFO python.PythonRunner: Times: total = 3481, boot = 3479, init = 2, finish = 0
16/04/22 12:39:04 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 995 bytes result sent to driver
16/04/22 12:39:04 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3557 ms on localhost (4/8)
16/04/22 12:39:05 INFO python.PythonRunner: Times: total = 4375, boot = 4372, init = 2, finish = 1
16/04/22 12:39:05 INFO executor.Executor: Finished task 2.0 in stage 0.0 (TID 2). 995 bytes result sent to driver
16/04/22 12:39:05 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 4451 ms on localhost (5/8)
16/04/22 12:39:06 INFO python.PythonRunner: Times: total = 5237, boot = 5235, init = 2, finish = 0
16/04/22 12:39:06 INFO executor.Executor: Finished task 6.0 in stage 0.0 (TID 6). 995 bytes result sent to driver
16/04/22 12:39:06 INFO scheduler.TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 5308 ms on localhost (6/8)
16/04/22 12:39:07 INFO python.PythonRunner: Times: total = 6072, boot = 6070, init = 2, finish = 0
16/04/22 12:39:07 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 995 bytes result sent to driver
16/04/22 12:39:07 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6167 ms on localhost (7/8)
16/04/22 12:39:08 INFO python.PythonRunner: Times: total = 6949, boot = 6947, init = 2, finish = 0
16/04/22 12:39:08 INFO executor.Executor: Finished task 5.0 in stage 0.0 (TID 5). 995 bytes result sent to driver
16/04/22 12:39:08 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 7015 ms on localhost (8/8)
16/04/22 12:39:08 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at <ipython-input-1-1383e3e82ca4>:8) finished in 7,043 s
16/04/22 12:39:08 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
16/04/22 12:39:08 INFO scheduler.DAGScheduler: Job 0 finished: reduce at <ipython-input-1-1383e3e82ca4>:8, took 7,252357 s

Spark UI: enter image description here enter image description here

0 个答案:

没有答案