在火花中反转RDD

时间:2017-05-06 15:24:01

标签: apache-spark

我有一个RDD [(时间戳,(a,b,c))]如下:

Timestamp   a       b       c
5:00 PM      523    384      40
6:00 PM      384    60     nan

我需要将上面的内容转换为RDD以下

key  values
a    [523,384]
b    [384,60]
c    [40,nan]

在spark中实现上述功能的最有效方法是什么?

1 个答案:

答案 0 :(得分:1)

     val raw = spark.sparkContext.parallelize(Seq(
      ("5:00 PM","523" ,"384" ,"40"),
      ("6:00 PM","384","60","nan")))
      .toDF("Timestamp", "a", "b","c")

    // drop timestamp column 
    val data = raw.drop("Timestamp")

    // iterate through columns and return value as tuple 
    val newData = data.columns.map(colName =>
       (colName, data.select(colName).map(r=>r.getAs[String](0)).collect())
    )
    // create a new Datafrane
    val finalData = spark.sparkContext.parallelize(newData).toDF("key", "value")

    finalData.show()
相关问题