如何将数字和分类功能传递给Apache Spark中的RandomForestRegressor:Java中的MLlib?
我能用数字或分类来做,但我不知道如何一起实现它。
我的工作代码如下(仅用于预测的数字特征)
String[] featureNumericalCols = new String[]{
"squareM",
"timeTimeToPragueCityCenter",
};
String[] featureStringCols = new String[]{ //not used
"type",
"floor",
"disposition",
};
VectorAssembler assembler = new VectorAssembler().setInputCols(featureNumericalCols).setOutputCol("features");
Dataset<Row> numericalData = assembler.transform(data);
numericalData.show();
RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price")
.setFeaturesCol("features");
// Chain indexer and forest in a Pipeline
Pipeline pipeline = new Pipeline()
.setStages(new PipelineStage[]{assembler, rf});
// Train model. This also runs the indexer.
PipelineModel model = pipeline.fit(trainingData);
// Make predictions.
Dataset<Row> predictions = model.transform(testData);
答案 0 :(得分:1)
对于那里的任何人,这是解决方案:
StringIndexer typeIndexer = new StringIndexer()
.setInputCol("type")
.setOutputCol("typeIndex");
preparedData = typeIndexer.fit(preparedData).transform(preparedData);
StringIndexer floorIndexer = new StringIndexer()
.setInputCol("floor")
.setOutputCol("floorIndex");
preparedData = floorIndexer.fit(preparedData).transform(preparedData);
StringIndexer dispositionIndexer = new StringIndexer()
.setInputCol("disposition")
.setOutputCol("dispositionIndex");
preparedData = dispositionIndexer.fit(preparedData).transform(preparedData);
String[] featureCols = new String[]{
"squareM",
"timeTimeToPragueCityCenter",
"floorIndex",
"floorIndex",
"dispositionIndex"
};
VectorAssembler assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features");
preparedData = assembler.transform(preparedData);
// ... some more impelemtation details
RandomForestRegressor rf = new RandomForestRegressor().setLabelCol("price")
.setFeaturesCol("features");
return rf.fit(preparedData);