我有一个处理过的数据框,格式为:
+--------+----------+
| labels | features |
+--------+----------+
|[1,0,0] | 1 |
+--------+----------+
|[0,0,0] | 0 |
+--------+----------+
....
其中labels
类型为DenseVector
,features
为整数。
我正在尝试使用PySpark ML库来训练带有此数据的随机森林模型,但我一直遇到以下错误:
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type NumericType but was actually of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.'
运行以下代码时:
def generateModel(dataSet):
# Begin generating our model object.
modelType = RandomForestClassifier()
# Model parameters stored here.
modelParameters = ParamGridBuilder() \
.baseOn({modelType.labelCol: 'features'}) \
.baseOn([modelType.predictionCol, 'label']) \
.addGrid(modelType.numTrees, [5]) \
.addGrid(modelType.maxDepth, [7]) \
.build()
evaluatorObject = CrossValidator(estimator=modelType,
estimatorParamMaps=modelParameters,
evaluator=BinaryClassificationEvaluator(),
numFolds=4)
trainedModel = evaluatorObject.fit(dataSet)
return trainedModel
我使用的是ml
库,而不是mllib
,因为我知道在使用DenseVector
方法时存在一些问题。