我使用Spark 2.1编写了这段代码:
val mycolumns = originalFile.schema.fieldNames
mycolumns.map(cname => stddevPerColumnName(df.select(cname), cname))
def stddevPerColumnName(df: DataFrame, cname: String): DataFrame =
new StandardScaler()
.setInputCol(cname)
.setOutputCol("stddev")
.setWithStd(true)
.fit(df)
.transform(df)
每个列都具有最初从CSV文件推断的DoubleType类型。 当我运行代码时,我得到了例外:
Column FirstColumn must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually DoubleType.
如何将列类型Double
转换为VectorUDT
?
答案 0 :(得分:0)
您需要将向量传递到ML model:
使用assembler
以将double values
放入vector
然后执行ML
然后从vector
中取值如果需要返回double
import org.apache.spark.ml.feature.{MinMaxScaler,VectorAssembler}
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.sql.functions._
val assembler = new VectorAssembler().setInputCols(Array("yourDoubleValue")).setOutputCol("features")
def assembler (ds: Dataset[T]) = {mlib.assembler.transform(ds)}
val vectorToColumn = udf{ (x: DenseVector, index: Int) => x(index) }
val scaler = new StandardScaler().setInputCol("features").setOutputCol("featuresScaled")
*根据您的数据使用DenseVector
或SparseVector
完整示例:
val data = spark.read....
val data_assembled = assembler.transform(data)
val assembled = scaler.fit(ds).transform(ds)
.withColumn("backToMyDouble",round(mlib.vectorToColumn(col("featuresScaled"),lit(0)),2))