为什么这个pyspark.ml.RandomForestRegressor因停止上下文而失败?

时间:2017-08-25 18:47:46

标签: apache-spark pyspark apache-spark-ml

我正在尝试在名为RandomForestRegressor的数据框上训练train,如下所示:

rf = pyspark.ml.regression.RandomForestRegressor(featuresCol=self.featuresCol, labelCol=self.labelCol)
param_grid = ParamGridBuilder()\
    .addGrid(rf.numTrees, [5, 10, 20]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .build()

crossval = CrossValidator(estimator=rf,
                          estimatorParamMaps=param_grid,
                          evaluator=RegressionEvaluator(),
                          numFolds=3)

self.model = crossval.fit(train)

以下是数据框中的行数,分区数,示例行和数据框架构:

Training on 26398 examples with 8 partitions
{'features': SparseVector(10479, {5: 1.0, 360: 1.0, 361: 0.2444, 362: -0.9697, 363: 1.0, 10476: -0.0685}),
 'label': 989}
root
 |-- features: vector (nullable = true)
 |-- label: long (nullable = true)

尝试拟合模型后的最终错误消息:

org.apache.spark.SparkException: Job 44 cancelled because SparkContext was shut down

导致此失败的原因是什么?

  • m4.xlarge
  • 8 vCPU
  • 16 GiB记忆

工人(4个实例)

  • r4.xlarge
  • 4 vCPU
  • 30.5 GiB记忆

0 个答案:

没有答案