我有以下RDD:
| Key | Value | Date |
|-----|-------|------------|
| 1 | A | 10/30/2016 |
| 1 | B | 10/31/2016 |
| 1 | C | 11/1/2016 |
| 1 | D | 11/2/2016 |
| 2 | A | 11/2/2016 |
| 2 | B | 11/2/2016 |
| 2 | C | 11/2/2016 |
| 3 | A | 10/30/2016 |
| 3 | B | 10/31/2016 |
| 3 | C | 11/1/2016 |
| 3 | D | 11/2/2016 |
我想将其转换为以下RDD:
| Key | List |
|-----|--------------|
| 1 | (A, B, C, D) |
| 2 | (A, B, C) |
| 3 | (A, B, C, D) |
Key,List(Value) - 其中值列表按相应日期排序。显然,所有键都是唯一的,但并非所有值都必须是唯一的。我仍然想列出所有的价值观。我怎么能做到这一点?
答案 0 :(得分:1)
创建一个表示数据的模型(你也可以使用元组,但是使用元组编码很快就会变得难看。为字段命名总是好的)
case class DataItem(key: Int, value: String, timeInMillis: Long)
然后
解析数据(可以使用joda DateTimeFormat解析DateTime)然后创建你的rdd
val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
然后是最后一步groupBy
键和sortBy时间
rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}
Scala REPL
scala> case class DataItem(key: Int, value: String, timeInMillis: Long)
defined class DataItem
scala> sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
res10: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[12] at parallelize at <console>:36
scala> val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
rdd: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[13] at parallelize at <console>:35
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}
res11: org.apache.spark.rdd.RDD[(Int, List[DataItem])] = MapPartitionsRDD[16] at map at <console>:38
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.foreach(println)
(1,List(DataItem(1,A,123)))
(2,List(DataItem(2,B,1234), DataItem(2,C,12345)))
scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.map { case (k, v) => (k, v.map(_.value)) }.foreach(println)
(1,List(A))
(2,List(B, C))