Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples

时间:2015-07-23 13:45:32

标签: scala apache-spark apache-spark-mllib

软件版本:Apache Spark v1.3

背景:我一直试图"翻译" Apache Spark上的Scala的Octave / MATLAB代码。更准确地说,我从 coursera 实用部分 ex1 开始 ex1data1.txt ex1data2.txt 。我已经对Julia lang进行了这样的翻译(顺利进行),现在我一直在与Spark挣扎......但没有成功。

问题:我在Spark上的实现性能很差。我甚至不能说它工作正常。这就是为什么 ex1data1.txt 我添加了多项式特征,我还使用了:theta0使用setIntercept(true)和额外的非标准化列1.0值(在这种情况下我设置拦截到假)。我只收到愚蠢的结果。 所以,我决定开始使用 ex1data2.txt 。您可以在下面找到代码和预期结果。当然Spark结果是错误的。

你有类似的经历吗?我将非常感谢你的帮助。

exd1data2.txt的Scala代码:

import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.optimization.SquaredL2Updater
import org.apache.spark.mllib.regression.{LinearRegressionModel, LinearRegressionWithSGD, LabeledPoint}
import org.apache.spark.{SparkContext, SparkConf}


object MLibOnEx1data2 extends App {
  val conf = new SparkConf()
  conf.set("spark.app.name", "coursera ex1data2.txt test")

  val sc = new SparkContext(conf)
  val input = sc.textFile("hdfs:///ex1data2.txt")

  val trainData = input.map { line =>
    val parts = line.split(',')
    val y = parts(2).toDouble
    val features = Vectors.dense(parts(0).toDouble, parts(1).toDouble)
    println(s"x = $features y = $y")
    LabeledPoint(y, features)
  }.cache()

  // Building the model
  val numIterations = 1500
  val alpha = 0.01

  // Scale the features
  val scaler = new StandardScaler(withMean = true, withStd = true)
    .fit(trainData.map(x => x.features))
  val scaledTrainData = trainData.map{ td =>
    val normFeatures = scaler.transform(td.features)
    println(s"normalized features = $normFeatures")
    LabeledPoint(td.label, normFeatures)
  }.cache()

  val tsize = scaledTrainData.count()
  println(s"Training set size is $tsize")


  val alg = new LinearRegressionWithSGD().setIntercept(true)
  alg.optimizer
    .setNumIterations(numIterations)
    .setStepSize(alpha)
    .setUpdater(new SquaredL2Updater)
    .setRegParam(0.0)  //regularization - off

  val model = alg.run(scaledTrainData)

  println(s"Theta is $model.weights")

  val total1 = model.predict(scaler.transform(Vectors.dense(1650, 3)))

  println(s"Estimate the price of a 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338

  // Evaluate model on training examples and compute training error
  val valuesAndPreds = scaledTrainData.map { point =>
    val prediction = model.predict(point.features)
    (point.label, prediction)
  }
  val MSE = ((valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2)
  println("Training Mean Squared Error = " + MSE)



  // Save and load model
  val trySaveAndLoad = util.Try(model.save(sc, "myModelPath"))
    .flatMap { _ => util.Try(LinearRegressionModel.load(sc, "myModelPath")) }
    .getOrElse(-1)

  println(s"trySaveAndLoad result is $trySaveAndLoad")
}

STDOUT结果是:

  

训练集大小为47

     

Theta是(权重= [52090.291641275864,19342.034885388926],   截距= 181295.93717032953).weights

     

估计1650平方英尺3房的价格= 153983.5541846754   元

     

训练均方误差= 1.5876093757127676E10

     

trySaveAndLoad结果为-1

1 个答案:

答案 0 :(得分:1)

嗯,经过一番挖掘,我相信这里什么都没有。首先,我将valuesAndPreds的内容保存到文本文件中:

valuesAndPreds.map{
   case {x, y} => s"$x,$y"}.repartition(1).saveAsTextFile("results.txt")'

其余代码用R编写。

首先让我们使用封闭式解决方案创建模型:

# Load data
df <- read.csv('results.txt/ex1data2.txt', header=FALSE)
# Scale features
df[, 1:2] <- apply(df[, 1:2], 2, scale)
# Build linear model 
model <- lm(V3 ~ ., df)

供参考:

> summary(model)

Call:
lm(formula = V3 ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-130582  -43636  -10829   43698  198147 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   340413       9637  35.323  < 2e-16 ***
V1            110631      11758   9.409 4.22e-12 ***
V2             -6650      11758  -0.566    0.575    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 66070 on 44 degrees of freedom
Multiple R-squared:  0.7329,    Adjusted R-squared:  0.7208 
F-statistic: 60.38 on 2 and 44 DF,  p-value: 2.428e-13

现在预测:

closedFormPrediction <- predict(model, df)
closedFormRMSE <- sqrt(mean((closedFormPrediction - df$V3)**2))
plot(
   closedFormPrediction, df$V3,
   ylab="Actual", xlab="Predicted",
   main=paste("Closed form, RMSE: ", round(closedFormRMSE, 3)))

enter image description here&#39;

现在我们可以将以上与SGD结果进行比较:

sgd <- read.csv('results.txt/part-00000', header=FALSE)
sgdRMSE <- sqrt(mean(sgd$V2 - sgd$V1)**2)

plot(
   sgd$V2, sgd$V1, ylab="Actual",
   xlab="Predicted", main=paste("SGD, RMSE: ", round(sgdRMSE, 3)))

enter image description here

最后让我们比较两者:

plot(
   sgd$V2, closedFormPrediction,
   xlab="SGD", ylab="Closed form", main="SGD vs Closed form")

enter image description here

所以,结果显然并不完美,但似乎没有任何东西完全脱离这里。