合并两种不同类型的RDD

时间:2017-06-15 12:47:03

标签: scala apache-spark apache-spark-mllib

大家好我想将RDD [Vector]和RDD [Int]结合到RDD [Vector] 这就是我所做的,我使用Kmeans来预测集群,这个想法是在每个向量前面添加相应的集群。这就是我做的事情

    val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data = spark.sparkContext.textFile("C:/spark/data/mllib/kmeans_data.txt")
 //Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()//RDD[vector]
val clusters = KMeans.train(parsedData, numClusters, numIterations)
val resultatOfprediction=clusters.predict(parsedData)//RDD[int]
val finalData=parsedData.zip(resultatOfprediction)
finalData.collect().foreach(println)

结果是

([0.0,0.0,0.0],0)
([0.1,0.1,0.1],0)
([0.2,0.2,0.2],0)
([9.0,9.0,9.0],1)
([9.1,9.1,9.1],1)
([9.2,9.2,9.2],1)

我想要的输出

    [0.0,0.0,0.0,1.0]
    [0.1,0.1,0.1,1.0]
    [0.2,0.2,0.2,1.0]
    [9.0,9.0,9.0,0.0]
    [9.1,9.1,9.1,0.0]
    [9.2,9.2,9.2,0.0]

目标是我想将最终的RDD [vector]保存到txt文件中以在网格中显示它。但是你提供的结果不是RDD [vector]

1 个答案:

答案 0 :(得分:2)

要获得您想要的结果,您需要压缩这两个RDD。这是你如何做的

val parsedData = spark.sparkContext.parallelize(Seq(1.0,1.0,1.0,0.0,0.0,0.0))

val resultatOfprediction = spark.sparkContext.parallelize(Seq(
  (0.0,0.0,0.0),
  (0.1,0.1,0.1),
  (0.2,0.2,0.2),
  (9.0,9.0,9.0),
  (9.1,9.1,9.1),
  (9.2,9.2,9.2)
))

resultatOfprediction.zip(parsedData)

因为它返回一个元组,你可以得到结果

resultatOfprediction.zip(parsedData)
      .map(t => (t._1._1, t._1._2, t._1._3, t._2))

对于动态,你可以像@ Rahul-Sukla resultatOfprediction.zip(parsedData) .map(t => t._1.productIterator.toList.map(_.asInstanceOf[Double]) :+ t._2)

一样消磨。

希望这有帮助!