Question

我已经成功地将我的数据转换为LibSVM文件，并在Spark的MLlib包中训练决策树模型。我在the 1.6.2 documentation中使用了Scala代码，只更改了文件名：

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)

// Save and load model
model.save(sc, "target/tmp/myDecisionTreeRegressionModel")
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeRegressionModel")

代码正确显示模型的MSE和学习树模型。但是，我一直在搞清楚如何使用sameModel并使用它来评估新数据。就像，如果我用来训练模型的LibSVM文件看起来像这样：

0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

如何将受过训练的模型提供给这样的东西，并让它预测标签？

1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

编辑（8/31/2017 3:56 PM，东部）

根据以下建议，我尝试使用预测功能，但看起来代码并不正确：

val new_data = MLUtils.loadLibSVMFile(sc, "hdfs://.../new_data/*")

val labelsAndPredictions = new_data.map { point =>
  val prediction = sameModel.predict(point.features)
  (point.label, prediction)
}

labelsAndPredictions.take(10)

如果我使用包含＆＃39; 1＆＃39;的LibSVM文件运行它将值作为标签（我在文件中测试了10个新行），然后它们全部返回为＆＃39; 1.0＆＃39;在labelsAndPredictions.take(10)命令中。如果我给它一个＆＃39; 0＆＃39; 0价值，那么它们都会以“0.0＆＃39;”的形式返回，所以看起来没有任何正确的预测。

Answer 1

load方法应该返回一个模型。然后使用RDD [Vector]或单个Vector调用predict。

Answer 2

加载原始数据（如上所述，类似的LibSVM文件）
提供有关分类功能的信息
对于上述数据中的每个点，通过调用：savedModel.predict（point.features）

Answer 3

您可以通过Pipeline

从磁盘加载ML模型

import org.apache.spark.ml._
val pipeline = Pipeline.read.load("sample-pipeline")

scala> val stageCount = pipeline.getStages.size
stageCount: Int = 0

val pipelineModel = PipelineModel.read.load("sample-model")

scala> pipelineModel.stages

获得pipeline后，可以对数据集进行评估：

val model = pipeline.fit(dataset)
val predictions = model.transform(dataset)

您必须使用正确的Evaluator，例如RegressionEvaluator。 Evaluator处理具有预测的数据集：

import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
println(regEval.explainParams)
regEval.evaluate(predictions)

UPD 如果您与hdfs达成协议，则可以轻松加载/保存模型：

将模型保存到HDFS的一种方法如下：

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/sample-model")

然后可以将已保存的模型加载为：

val linRegModel = sc.objectFile[LinearRegressionModel]("hdfs:///user/root/sample-model").first()
linRegModel.predict(Vectors.dense(11.0, 2.0, 2.0, 1.0, 2200.0))

或类似于上面的示例，而是本地文件hdfs：

PipelineModel.read.load("hdfs:///user/root/sample-model")

Answer 4

使用hdfs将文件发送到目录，所有节点都可以在群集中看到该文件。在您的代码中加载并预测。

使用保存的Spark模型评估新数据

4 个答案: