Spark MLLib线性回归模型截距始终为0.0?

时间:2014-10-08 14:42:01

标签: scala apache-spark apache-spark-mllib

我刚开始使用ML和Apache Spark,所以我一直在尝试基于Spark示例的线性回归。除了示例中的示例之外,我似乎无法为任何数据生成适当的模型,并且无论输入数据如何,截距始终为0.0。

我已根据功能准备了一个简单的训练数据集:

y =(2 * x1)+(3 * x2)+ 4

即。我希望截距为4,权重为(2,3)。

如果我对原始数据运行LinearRegressionWithSGD.train(...),则模型为:

Model intercept: 0.0, weights: [NaN,NaN]

预测都是NaN:

Features: [1.0,1.0], Predicted: NaN, Actual: 9.0
Features: [1.0,2.0], Predicted: NaN, Actual: 12.0

如果我先缩放数据,我会得到:

Model intercept: 0.0, weights: [17.407863391511754,2.463212481736855]

Features: [1.0,1.0], Predicted: 19.871075873248607, Actual: 9.0
Features: [1.0,2.0], Predicted: 22.334288354985464, Actual: 12.0
Features: [1.0,3.0], Predicted: 24.797500836722318, Actual: 15.0

要么我做错了,要么我不明白这个模型的输出应该是什么,那么有人可以建议我在这里出错吗?

我的代码如下:

   // Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
   // i.e. intercept should be 4, weights (2, 3)?
   val data = sc.textFile("data/dummydata.txt")

   // LabeledPoint is (label, [features])
   val parsedData = data.map { line =>
    val parts = line.split(',')
    val label = parts(0).toDouble
    val features = Array(parts(1), parts(2)) map (_.toDouble)
    LabeledPoint(label, Vectors.dense(features))
  }

  // Scale the features
  val scaler = new StandardScaler(withMean = true, withStd = true)
                   .fit(parsedData.map(x => x.features))
  val scaledData = parsedData
                  .map(x => 
                  LabeledPoint(x.label, 
                     scaler.transform(Vectors.dense(x.features.toArray))))

  // Building the model: SGD = stochastic gradient descent
  val numIterations = 1000
  val step = 0.2
  val model = LinearRegressionWithSGD.train(scaledData, numIterations, step)

  println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")`

  // Evaluate model on training examples
  val valuesAndPreds = scaledData.map { point =>
    val prediction = model.predict(point.features)
    (point.label, point.features, prediction)
  }
  // Print out features, actual and predicted values...
  valuesAndPreds.take(10).foreach({case (v, f, p) => 
      println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")})

2 个答案:

答案 0 :(得分:11)

@Noah:谢谢 - 您的建议促使我再次查看此内容,并且我发现some example code here允许您生成拦截并通过优化器设置其他参数,例如迭代次数。< / p>

这是我修改过的模型生成代码,它似乎对我的虚拟数据运行正常:

  // Building the model: SGD = stochastic gradient descent:
  // Need to setIntercept = true, and seems only to work with scaled data 
  val numIterations = 600
  val stepSize = 0.1
  val algorithm = new LinearRegressionWithSGD()
  algorithm.setIntercept(true)
  algorithm.optimizer
    .setNumIterations(numIterations)
    .setStepSize(stepSize)

  val model = algorithm.run(scaledData)

它似乎仍然需要缩放数据而不是原始数据作为输入,但这对我的目的来说还不错。

答案 1 :(得分:9)

您使用的train方法是一种快捷方式,可将截距设置为零,并且不会尝试查找截距。如果使用基础类,则可以获得非零截距:

val model = new LinearRegressionWithSGD(step, numIterations, 1.0).
    setIntercept(true).
    run(scaledData)

现在应该给你一个拦截。