如何将org.apache.spark.rdd.RDD [Array [Double]]转换为Spark MLlib所需的Array [Double]

时间:2015-01-08 06:29:33

标签: apache-spark apache-spark-mllib

我正在尝试实施KMeans using Apache Spark

val data = sc.textFile(irisDatasetString)
val parsedData = data.map(_.split(',').map(_.toDouble)).cache()

val clusters = KMeans.train(parsedData,3,numIterations = 20)

我得到以下错误:

error: overloaded method value train with alternatives:
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int)org.apache.spark.mllib.clustering.KMeansModel <and>
  (data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector],k: Int,maxIterations: Int,runs: Int,initializationMode: String)org.apache.spark.mllib.clustering.KMeansModel
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]], Int, numIterations: Int)
       val clusters = KMeans.train(parsedData,3,numIterations = 20)

所以我尝试将Array [Double]转换为Vector,如here

所示
scala> val vectorData: Vector = Vectors.dense(parsedData)

我收到了以下错误:

error: type Vector takes type parameters
   val vectorData: Vector = Vectors.dense(parsedData)
                   ^
error: overloaded method value dense with alternatives:
  (values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
  (firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
 cannot be applied to (org.apache.spark.rdd.RDD[Array[Double]])
       val vectorData: Vector = Vectors.dense(parsedData)

所以我推断 org.apache.spark.rdd.RDD[Array[Double]] 与Array [Double]不一样

如何以 org.apache.spark.rdd.RDD[Array[Double]] 继续处理我的数据?或者我如何转换 org.apache.spark.rdd.RDD[Array[Double]] to Array[Double]

1 个答案:

答案 0 :(得分:6)

KMeans.train期待RDD[Vector]而不是RDD[Array[Double]]。在我看来,你需要做的就是改变

val parsedData = data.map(_.split(',').map(_.toDouble)).cache()

val parsedData = data.map(x => Vectors.dense(x.split(',').map(_.toDouble))).cache()