Question

我有一个处理过的数据框，格式为：

+--------+----------+
| labels | features |
+--------+----------+
|[1,0,0] |    1     |
+--------+----------+
|[0,0,0] |    0     |
+--------+----------+
....

其中labels类型为DenseVector，features为整数。

我正在尝试使用PySpark ML库来训练带有此数据的随机森林模型，但我一直遇到以下错误：

pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type NumericType but was actually of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.'

运行以下代码时：

def generateModel(dataSet):

# Begin generating our model object.
modelType = RandomForestClassifier()

# Model parameters stored here.
modelParameters = ParamGridBuilder() \
         .baseOn({modelType.labelCol: 'features'}) \
         .baseOn([modelType.predictionCol, 'label']) \
         .addGrid(modelType.numTrees, [5]) \
         .addGrid(modelType.maxDepth, [7]) \
         .build()

evaluatorObject = CrossValidator(estimator=modelType,
                    estimatorParamMaps=modelParameters,
                    evaluator=BinaryClassificationEvaluator(),
                    numFolds=4)

trainedModel = evaluatorObject.fit(dataSet)

return trainedModel

我使用的是ml库，而不是mllib，因为我知道在使用DenseVector方法时存在一些问题。

使用DenseVector作为PySpark随机森林建模的输入标签

0 个答案: