有人可以帮助我解决以下错误吗?我正在尝试将数据帧转换为rdd,以便它可以用于回归模型构建。
SPARK VERSION:2.0.0
错误=> ClassCastException:org.apache.spark.ml.linalg。 DenseVector 无法转换为 org.apache.spark.mllib.linalg。矢量
代码=>
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.Row
val binarizer2: Binarizer = new Binarizer()
.setInputCol("repay_amt").setOutputCol("label").setThreshold(20.00)
df = binarizer2.transform(df)
val assembler = new VectorAssembler()
.setInputCols(Array("tot_txns", "avg_unpaiddue", "max_unpaiddue", "sale_txn", "max_amt", "tot_sale_amt")).setOutputCol("features")
df = assembler.transform(df)
df.write.mode(SaveMode.Overwrite).parquet("lazpay_final_data.parquet")
val df2 = spark.read.parquet("lazpay_final_data.parquet/")
val df3= df2.rdd.map(r => LabeledPoint(r.getDouble(0),r.getAs("features")))
数据=>
答案 0 :(得分:3)
我遇到了同样的问题并创建了一个手动转换值的函数:
public static Function<Row, org.apache.spark.mllib.linalg.Vector> rowToVector = new Function<Row, org.apache.spark.mllib.linalg.Vector>() {
public org.apache.spark.mllib.linalg.Vector call(Row row) throws Exception {
Object features = row.getAs(0);
org.apache.spark.ml.linalg.DenseVector dense = null;
if (features instanceof org.apache.spark.ml.linalg.DenseVector){
dense = (org.apache.spark.ml.linalg.DenseVector)features;
}
else if(features instanceof org.apache.spark.ml.linalg.SparseVector){
org.apache.spark.ml.linalg.SparseVector sparse = (org.apache.spark.ml.linalg.SparseVector)features;
dense = sparse.toDense();
}else{
RuntimeException e = new RuntimeException("Cannot convert to "+ features.getClass().getCanonicalName());
LOGGER.error(e.getMessage());
throw e;
}
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(dense.toArray());
return vec;
}
};
答案 1 :(得分:3)
由于您使用Spark 2.0或更高版本, 而不是导入org.apache.spark.mllib.linalg.Vectors 使用 import org.apache.spark.ml.linalg.Vectors
答案 2 :(得分:2)
我通过首先将ml SparseVector转换为Dense Vector然后转换为mllib Vector来解决了这个问题。
例如:
val denseVector = r.getAs[org.apache.spark.ml.linalg.SparseVector]("features").toDense
org.apache.spark.mllib.linalg.Vectors.fromML(denseVector)