(阵列/ ML矢量/ MLlib矢量)RDD到ML矢量数据帧库

时间:2016-09-02 18:27:39

标签: scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

我需要将RDD转换为单个列o.a.s.ml.linalg.Vector DataFrame,以便使用ML算法,特别是K-Means。这是我的RDD:

val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.mllib.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))

我尝试做this回答建议但没有运气,我想因为你最终得到了MLlib Vector,它在运行算法时会抛出不匹配错误。现在如果我改变了这个:

import org.apache.spark.mllib.linalg.{Vectors, VectorUDT}

val schema = new StructType()
  .add("features", new VectorUDT())

到此:

import org.apache.spark.ml.linalg.{Vectors, VectorUDT}

val parsedData = sc.textFile("/digits480x.csv").map(s => Row(org.apache.spark.ml.linalg.Vectors.dense(s.split(',').slice(0,64).map(_.toDouble))))

val schema = new StructType()
  .add("features", new VectorUDT())

我会收到错误,因为ML VectorUDT是私有的。

我也尝试将RDD转换为双精度数组到Dataframe,然后像这样得到ML Dense Vector:

var parsedData = sc.textFile("/home/pililo/Documents/Mi_Memoria/Codigo/Datasets/Digits/digits480x.csv").map(s => Row(s.split(',').slice(0,64).map(_.toDouble)))

parsedData: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]

val schema2 = new StructType().add("features", ArrayType(DoubleType))

schema2: org.apache.spark.sql.types.StructType = StructType(StructField(features,ArrayType(DoubleType,true),true))

val df = spark.createDataFrame(parsedData, schema2)

df: org.apache.spark.sql.DataFrame = [features: array<double>]

val df2 = df.map{ case Row(features: Array[Double]) => Row(org.apache.spark.ml.linalg.Vectors.dense(features)) }

即使导入spark.implicits._,也会引发以下错误:

error: Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  Support for serializing other types will be added in future releases.

非常感谢任何帮助,谢谢!

1 个答案:

答案 0 :(得分:2)

脱离我的头脑:

  1. 使用csv来源和VectorAssembler

    import scala.util.Try
    import org.apache.spark.ml.linalg._
    import org.apache.spark.ml.feature.VectorAssembler
    
    val path: String = ???
    
    val n: Int = ???
    val m:Int = ???
    
    val raw = spark.read.csv(path)
    val featureCols = raw.columns.slice(n, m)
    
    val exprs = featureCols.map(c => col(c).cast("double"))
    val assembler = new VectorAssembler()
      .setInputCols(featureCols)
      .setOutputCol("features")
    
    assembler.transform(raw.select(exprs: _*)).select($"features")
    
  2. 使用text源和UDF:

    def parse_(n: Int, m: Int)(s: String) = Try(
      Vectors.dense(s.split(',').slice(n, m).map(_.toDouble))
    ).toOption
    
    def parse(n: Int, m: Int) = udf(parse_(n, m) _)
    
    val raw = spark.read.text(path)
    
    raw.select(parse(n, m)(col(raw.columns.head)).alias("features"))
    
  3. 使用text来源并放弃换行Row

    spark.read.text(path).as[String].map(parse_(n, m)).toDF