使用DenseVector作为PySpark随机森林建模的输入标签

时间:2017-09-27 03:04:10

标签: machine-learning pyspark spark-dataframe random-forest apache-spark-ml

我有一个处理过的数据框,格式为:

+--------+----------+
| labels | features |
+--------+----------+
|[1,0,0] |    1     |
+--------+----------+
|[0,0,0] |    0     |
+--------+----------+
....

其中labels类型为DenseVectorfeatures为整数。

我正在尝试使用PySpark ML库来训练带有此数据的随机森林模型,但我一直遇到以下错误:

pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type NumericType but was actually of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.'

运行以下代码时:

def generateModel(dataSet):

# Begin generating our model object.
modelType = RandomForestClassifier()

# Model parameters stored here.
modelParameters = ParamGridBuilder() \
         .baseOn({modelType.labelCol: 'features'}) \
         .baseOn([modelType.predictionCol, 'label']) \
         .addGrid(modelType.numTrees, [5]) \
         .addGrid(modelType.maxDepth, [7]) \
         .build()

evaluatorObject = CrossValidator(estimator=modelType,
                    estimatorParamMaps=modelParameters,
                    evaluator=BinaryClassificationEvaluator(),
                    numFolds=4)

trainedModel = evaluatorObject.fit(dataSet)

return trainedModel

我使用的是ml库,而不是mllib,因为我知道在使用DenseVector方法时存在一些问题。

0 个答案:

没有答案