如果spark.sql.shuffle.partitions太高,则Pyspark集群在单元测试期间崩溃

时间:2019-07-18 09:58:28

标签: apache-spark pyspark crash

spark.sql.shuffle.partitions设置为〜5以上时,我的带有PySpark固定装置的单元测试套件失败,但如果设置为1以上则通过,否则我无法完全理解为什么可以解决此问题。输出以下多行后,测试开始始终失败:

2019-07-17 09:00:18,896 WARN executor.Executor: Managed memory leak detected; size = 2097152 bytes, TID = 11677

此后,它们开始失败,并显示以下信息:

------------------------------------------ Captured log setup -------------------------------------------
ERROR    py4j.java_gateway:java_gateway.py:1078 An error occurred while trying to connect to the Java server (127.0.0.1:38661)
Traceback (most recent call last):
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1067, in start
    self.socket.connect((self.address, self.port))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused

因为火花簇已经崩溃。

我尝试过的大多数操作是在遇到错误时监视错误,但是Spark Management UI(通过在调试器崩溃之前暂停测试来及时进行调试)或pyspark事件探查器几乎没有提供任何帮助帮助诊断问题。

2019-07-18 09:47:02,033 INFO codegen.CodeGenerator: Code generated in 96.9498 ms
2019-07-18 09:47:02,037 INFO python.PythonRunner: Times: total = 67, boot = -98, init = 165, finish = 0
2019-07-18 09:47:02,039 DEBUG memory.TaskMemoryManager: Task 8051 release 256.0 KB from org.apache.spark.unsafe.map.BytesToBytesMap@7fc5f5ce
2019-07-18 09:47:02,042 INFO executor.Executor: Finished task 0.0 in stage 146.0 (TID 8051). 3018 bytes result sent to driver
2019-07-18 09:47:02,043 DEBUG scheduler.TaskSchedulerImpl: parentName: , name: TaskSet_146.0, runningTasks: 5
2019-07-18 09:47:02,044 DEBUG scheduler.TaskSetManager: No tasks for locality level NO_PREF, so moving to locality level ANY
2019-07-18 09:47:02,046 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 146.0 (TID 8051) in 1082 ms on localhost (executor driver) (1/6)
2019-07-18 09:47:02,051 DEBUG scheduler.DAGScheduler: ShuffleMapTask finished on driver
[Stage 146:=========>                                               (1 + 5) / 6]2019-07-18 09:47:02,108 INFO python.PythonRunner: Times: total = 532, boot = -103, init = 635, finish = 0
2019-07-18 09:47:02,110 DEBUG memory.TaskMemoryManager: Task 8053 release 256.0 KB from org.apache.spark.unsafe.map.BytesToBytesMap@3bab24ae
2019-07-18 09:47:02,116 INFO executor.Executor: Finished task 2.0 in stage 146.0 (TID 8053). 3018 bytes result sent to driver
2019-07-18 09:47:02,117 DEBUG scheduler.TaskSchedulerImpl: parentName: , name: TaskSet_146.0, runningTasks: 4
2019-07-18 09:47:02,119 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 146.0 (TID 8053) in 1153 ms on localhost (executor driver) (2/6)
2019-07-18 09:47:02,121 DEBUG scheduler.DAGScheduler: ShuffleMapTask finished on driver
[Stage 146:===================>                                     (2 + 4) / 6]----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 45094)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 318, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 331, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 652, in __init__
    self.handle()
  File "/opt/spark/python/pyspark/accumulators.py", line 268, in handle
    poll(accum_updates)
  File "/opt/spark/python/pyspark/accumulators.py", line 241, in poll
    if func():
  File "/opt/spark/python/pyspark/accumulators.py", line 245, in accum_updates
    num_updates = read_int(self.rfile)
  File "/opt/spark/python/pyspark/serializers.py", line 714, in read_int
    raise EOFError
EOFError

0 个答案:

没有答案