Question

我正在研究5节点集群，每个集群有7核，每个节点25GB。我当前的执行使用1-2GB输入数据，我可以知道为什么我会低于错误？我使用pyspark dataframe（spark 1.6.2）

[Stage 9487:===================================================>(198 + 2) / 200]16/08/13 16:43:18 ERROR TaskSchedulerImpl: Lost executor 3 on server05: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=================================================>(198 + -49) / 200]16/08/13 16:43:19 ERROR TaskSchedulerImpl: Lost executor 1 on server04: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=========>                                          (24 + 15) / 125]16/08/13 16:46:38 ERROR TaskSchedulerImpl: Lost executor 2 on server01: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:==========================================>        (148 + 30) / 178]16/08/13 16:51:36 ERROR TaskSchedulerImpl: Lost executor 0 on server03: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:=============================>                       (50 + 12) / 91]16/08/13 16:55:32 ERROR TaskSchedulerImpl: Lost executor 4 on server02: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
[Stage 9487:============================>                       (50 + -39) / 91]Traceback (most recent call last):
  File "/home/example.py", line 397, in <module>

  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 269, in count
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/sql/utils.py", line 45, in deco
  File "/home/spark-1.6.2-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o9162.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 9487 (count at null:-1) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 577
        at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutpu

如何将以下groupBYKey更改为ReduceByKey。 Python 3.4+，spark 1.6.2

def testFunction (key, values)
    << Some statistical process for each group>>
    << each group will have n (300K to 1M) rows>>
    << i am applying statistical function to each group>>


resRDD = df.select(["key1", "key2", "key3", "val1",  "val2"])\
           .map(lambda r: (Row(key1 = r.key1, key2 = r.key2, key3 = r.key3), r))\
           .groupByKey()\
           .flatMap(lambda KeyValue: testFunction(KeyValue[0], list(KeyValue[1])))

Answer 1

我通过减少 spark.executor.memory 解决了这个问题。也许在我的集群上运行除了 spark 之外还有其他应用程序。吃掉所有这些记忆会减慢我的工作人员的速度，从而导致通信中断。

远程RPC客户端已取消关联。可能由于容器超出阈值或网络问题。检查驱动程序日志以获取WARN消息

1 个答案: