我有一个数据帧,包含2,818,615行388长度pyspark.ml.linalg.SparseVector
和一个类标签。我想使用此数据集使用pyspark ml RandomForestClassifier
。每次我尝试训练模型时,火花会在失败前运行大约30分钟,因为sparkContext
被关闭了。如果我将数据集的大小限制为仅25K行,则会成功训练模型,但我需要使用更大的数据集。
这里有哪些故障排除步骤?
print(df.rdd.getNumPartitions())
8
df.show()
+--------------------+-----+
| features|label|
+--------------------+-----+
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 0|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 2|
|(388,[1,355,361,3...| 2|
|(388,[1,355,361,3...| 1|
|(388,[1,355,361,3...| 0|
+--------------------+-----+
only showing top 20 rows
我的硬件:
以下是我(尝试)训练模型的方法:
rf = RandomForestClassifier(featuresCol='features', labelCol='label')
grid = ParamGridBuilder().addGrid(rf.numTrees, [30, 50, 75]).addGrid(rf.maxDepth, [10, 20]).build()
evaluator = MulticlassClassificationEvaluator(metricName="f1")
cv = SparkCV(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator, numFolds=3)
cvModel = cv.fit(df)
追溯声称作业失败,原因是:
py4j.protocol.Py4JJavaError: An error occurred while calling o417.fit.
: org.apache.spark.SparkException: Job 76 cancelled because SparkContext was shut down
以下是火花日志的最后几行:
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 31.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 13 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 14.
17/11/07 23:15:04 INFO YarnAllocator: Driver requested a total number of 12 executor(s).
17/11/07 23:15:04 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 12.
17/11/07 23:16:21 INFO YarnAllocator: Driver requested a total number of 9 executor(s).
17/11/07 23:16:21 INFO ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 30, 18, 19.
17/11/07 23:20:07 ERROR ApplicationMaster: RECEIVED SIGNAL TERM
17/11/07 23:20:07 INFO ApplicationMaster: Final app status: UNDEFINED, exitCode: 16, (reason: Shutdown hook called before final status was reported.)
17/11/07 23:20:07 INFO ShutdownHookManager: Shutdown hook called