我有一个RDD [(时间戳,(a,b,c))]如下:
Timestamp a b c
5:00 PM 523 384 40
6:00 PM 384 60 nan
我需要将上面的内容转换为RDD以下
key values
a [523,384]
b [384,60]
c [40,nan]
在spark中实现上述功能的最有效方法是什么?
答案 0 :(得分:1)
val raw = spark.sparkContext.parallelize(Seq(
("5:00 PM","523" ,"384" ,"40"),
("6:00 PM","384","60","nan")))
.toDF("Timestamp", "a", "b","c")
// drop timestamp column
val data = raw.drop("Timestamp")
// iterate through columns and return value as tuple
val newData = data.columns.map(colName =>
(colName, data.select(colName).map(r=>r.getAs[String](0)).collect())
)
// create a new Datafrane
val finalData = spark.sparkContext.parallelize(newData).toDF("key", "value")
finalData.show()