Question

看起来训练和测试集应该在Apache Spark中创建分类模型时出现？如果我们有一些看不见的实例，并且在我们创建模型时不存在，该怎么办？当我们收到一个看不见的实例时，我们是否必须重新构建模型？它不能使分类在实际场景中不切实际吗？

Answer 1

看起来训练和测试集应该在Apache Spark中创建分类模型时出现？

可以从火车实例中加载测试实例，如Naive Bayes example中所示。

from pyspark.mllib.classification import NaiveBayes
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

def parseLine(line):
    parts = line.split(',')
    label = float(parts[0])
    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
    return LabeledPoint(label, features)

data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)

# Split data aproximately into training (60%) and test (40%)
training, test = data.randomSplit([0.6, 0.4], seed = 0)

# Train a naive Bayes model.
model = NaiveBayes.train(training, 1.0)

# Make prediction and test accuracy.
predictionAndLabel = test.map(lambda p : (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()

如果我们有一些看不见的实例，并且在我们创建模型时不存在，该怎么办？

这种情况与scikit和其他机器学习工具相同，尽管Spark提供了一些可以处理流的算法。

Answer 2

I do not really understand why both training and testing have to be present at the same time "of classification model creation in Apache Spark". The whole idea of splitting between training and testing is that you use training for building the model and keep testing separated so you can then use it to test your predictions, by evaluating it in "unseen" data.

For example, when you split your data you do something like:

val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

Then when you train your model lets say a classifier for ex Logistic Regression you only pass trainingData to it, it is not aware of the existence of any testData when it is trained.

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}
val modelLR = new LogisticRegressionWithLBFGS().run(trainingData)

Then you can test your results in testData like:

val testing = testData.map { point =>
  val prediction = modelLR.predict(point.features)
  (point.label, prediction)
}

Here evaluating the model only needs a vector of features specified in val prediction = modelLR.predict(point.features). The point.label is used for calculating performance metrics like accuracy, precision etc etc.

If you are thinking about how to put this model built into production where new unseen instances need prediction you only need to create a LabeledPoint vector with the same features as your model was trained with, and it will return a prediction result.

I hope this helps

如何将Spark Classifaction用于看不见的实例？

2 个答案: