我有一个Scala类,旨在概括线性模型的某些功能 - 特别是用户应该能够创建一个包含系数数组和预测变量数组的实例,并且该类从DataFrame中提取数据,并使用简单的线性模型在整个DataFrame上创建预测,如下所示。
我被困在最后一行......我希望生成一列预测值。我已经尝试了许多方法(除了其中一个之外的所有方法都被注释掉了)。现在的代码不会编译类型不匹配的b / c:
[error] found : Array[org.apache.spark.sql.Column]
[error] required: org.apache.spark.sql.Column
[error] .withColumn("prediction", colMod(preds.map(p => data(p))))
[error] ^
...我也得到了pred< - preds版本......以及foreach版本:
[error] found : Unit
[error] required: org.apache.spark.sql.Column
[error] .withColumn("prediction", colMod(preds.foreach(data(_))))
[error] ^
徒劳无功地解决......会对任何建议表示感谢。
class LinearModel(coefficients: Array[Double],
predictors: Array[String],
data: DataFrame) {
val coefs = coefficients
val preds = Array.concat(Array("bias"), predictors)
require(coefs.length == preds.length)
/**
* predict: computes linear model predictions as the dot product of the coefficents and the
* values (X[i] in the model matrix)
* @param values: the values from a single row of the given variables from model matrix X
* @param coefs: array of coefficients to be applied to each of the variables in values
* (the first coef is assumed to be 1 for the bias/intercept term)
* @return: the predicted value
*/
private def predict(values: Array[Double], coefs: Array[Double]): Unit = {
(for ((c, v) <- coefs.zip(values)) yield c * v).sum
}
/**
* colMod (udf): passes the values for each relevant value to predict()
* @param values: an Array of the numerical values of each of the specified predictors for a
* given record
*/
private val colMod = udf((values: Array[Double]) => predict(values, coefs))
val dfPred = data
// create the column with the prediction
.withColumn("prediction", colMod(preds.map(p => data(p))))
//.withColumn("prediction", colMod(for (pred <- preds) yield data(pred)))
//.withColumn("prediction", colMod(preds.foreach(data(_))))
// prev line should = colMod(data(pred1), data(pred2), ..., data(predn))
}
答案 0 :(得分:1)
以下是如何正确完成的事情:
import org.apache.spark.sql.functions.{lit, col}
import org.apache.spark.sql.Column
def predict(coefficients: Seq[Double], predictors: Seq[String], df: DataFrame) = {
// I assume there is no predictor for bias
// but you can easily correct for that
val prediction: Column = predictors.zip(coefficients).map {
case (p, c) => col(p) * lit(c)
}.foldLeft(col("bias"))(_ + _)
df.withColumn("prediction", prediction)
}
使用示例:
val df = Seq((1.0, -1.0, 3.0, 5.0)).toDF("bias", "x1", "x2", "x3")
predict(Seq(2.0, 3.0), Seq("x1", "x3"), df)
结果为:
+----+----+---+---+----------+
|bias| x1| x2| x3|prediction|
+----+----+---+---+----------+
| 1.0|-1.0|3.0|5.0| 14.0|
+----+----+---+---+----------+
关于您的代码,您犯了一些错误:
Array[_]
不是ArrayType
列的有效外部类型。有效的外部表示为Seq[_]
,因此您传递给udf
的函数的参数应为Seq[Double]
。udf
的功能不能是Unit
。在您的情况下,它应该是Double
。结合前一点,有效签名为(Seq[Double], Seq[Double]) => Double
。 colMod
需要一个Column
类型的参数。
import org.apache.spark.sql.functions.array
colMod(array(preds.map(col): _*))
您的代码不是NULL
/ null
安全的。