是什么原因导致sparkContext意外关闭?

时间:2017-11-08 00:55:40

标签: apache-spark pyspark yarn apache-spark-ml

我有一个数据帧,包含2,818,615行388长度pyspark.ml.linalg.SparseVector和一个类标签。我想使用此数据集使用pyspark ml RandomForestClassifier。每次我尝试训练模型时,火花会在失败前运行大约30分钟,因为sparkContext被关闭了。如果我将数据集的大小限制为仅25K行,则会成功训练模型,但我需要使用更大的数据集。

这里有哪些故障排除步骤?

print(df.rdd.getNumPartitions())   
8

df.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    0|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    2|
|(388,[1,355,361,3...|    2|
|(388,[1,355,361,3...|    1|
|(388,[1,355,361,3...|    0|
+--------------------+-----+
only showing top 20 rows

我的硬件:

  • 工人:4个vCPU,30.5个GiB内存,4个实例
  • Master:8个vCPU,16个GiB内存

以下是我(尝试)训练模型的方法:

rf = RandomForestClassifier(featuresCol='features', labelCol='label')
grid = ParamGridBuilder().addGrid(rf.numTrees, [30, 50, 75]).addGrid(rf.maxDepth, [10, 20]).build()
evaluator = MulticlassClassificationEvaluator(metricName="f1")
cv = SparkCV(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator, numFolds=3)
cvModel = cv.fit(df)

追溯声称作业失败,原因是:

py4j.protocol.Py4JJavaError: An error occurred while calling o417.fit.
: org.apache.spark.SparkException: Job 76 cancelled because SparkContext was shut down

以下是火花日志的最后几行:

17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 31.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 13 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 12 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 12.
17/11/07 23:16:21 INFO YarnAllocator: Driver requested a total number of 9 executor(s).
17/11/07 23:16:21 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 30, 18, 19.
17/11/07 23:20:07 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/11/07 23:20:07 INFO ApplicationMaster: Final app status: UNDEFINED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/11/07 23:20:07 INFO ShutdownHookManager: Shutdown hook called

0 个答案:

没有答案