大家好我想将RDD [Vector]和RDD [Int]结合到RDD [Vector] 这就是我所做的,我使用Kmeans来预测集群,这个想法是在每个向量前面添加相应的集群。这就是我做的事情
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val data = spark.sparkContext.textFile("C:/spark/data/mllib/kmeans_data.txt")
//Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()//RDD[vector]
val clusters = KMeans.train(parsedData, numClusters, numIterations)
val resultatOfprediction=clusters.predict(parsedData)//RDD[int]
val finalData=parsedData.zip(resultatOfprediction)
finalData.collect().foreach(println)
结果是
([0.0,0.0,0.0],0)
([0.1,0.1,0.1],0)
([0.2,0.2,0.2],0)
([9.0,9.0,9.0],1)
([9.1,9.1,9.1],1)
([9.2,9.2,9.2],1)
我想要的输出
[0.0,0.0,0.0,1.0]
[0.1,0.1,0.1,1.0]
[0.2,0.2,0.2,1.0]
[9.0,9.0,9.0,0.0]
[9.1,9.1,9.1,0.0]
[9.2,9.2,9.2,0.0]
目标是我想将最终的RDD [vector]保存到txt文件中以在网格中显示它。但是你提供的结果不是RDD [vector]
答案 0 :(得分:2)
要获得您想要的结果,您需要压缩这两个RDD。这是你如何做的
val parsedData = spark.sparkContext.parallelize(Seq(1.0,1.0,1.0,0.0,0.0,0.0))
val resultatOfprediction = spark.sparkContext.parallelize(Seq(
(0.0,0.0,0.0),
(0.1,0.1,0.1),
(0.2,0.2,0.2),
(9.0,9.0,9.0),
(9.1,9.1,9.1),
(9.2,9.2,9.2)
))
resultatOfprediction.zip(parsedData)
因为它返回一个元组,你可以得到结果
resultatOfprediction.zip(parsedData)
.map(t => (t._1._1, t._1._2, t._1._3, t._2))
对于动态,你可以像@ Rahul-Sukla resultatOfprediction.zip(parsedData) .map(t => t._1.productIterator.toList.map(_.asInstanceOf[Double]) :+ t._2)
希望这有帮助!