使用有序的值列表减少键

时间:2016-11-03 21:33:50

标签: scala apache-spark

我有以下RDD:

| Key | Value | Date       |
|-----|-------|------------|
| 1   | A     | 10/30/2016 |
| 1   | B     | 10/31/2016 |
| 1   | C     | 11/1/2016  |
| 1   | D     | 11/2/2016  |
| 2   | A     | 11/2/2016  |
| 2   | B     | 11/2/2016  |
| 2   | C     | 11/2/2016  |
| 3   | A     | 10/30/2016 |
| 3   | B     | 10/31/2016 |
| 3   | C     | 11/1/2016  |
| 3   | D     | 11/2/2016  |

我想将其转换为以下RDD:

| Key | List         |
|-----|--------------|
| 1   | (A, B, C, D) |
| 2   | (A, B, C)    |
| 3   | (A, B, C, D) |

Key,List(Value) - 其中值列表按相应日期排序。显然,所有键都是唯一的,但并非所有值都必须是唯一的。我仍然想列出所有的价值观。我怎么能做到这一点?

1 个答案:

答案 0 :(得分:1)

创建一个表示数据的模型(你也可以使用元组,但是使用元组编码很快就会变得难看。为字段命名总是好的)

case class DataItem(key: Int, value: String, timeInMillis: Long)

然后

解析数据(可以使用joda DateTimeFormat解析DateTime)然后创建你的rdd

val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))

然后是最后一步groupBy键和sortBy时间

rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}

Scala REPL

scala> case class DataItem(key: Int, value: String, timeInMillis: Long)
defined class DataItem

scala> sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
res10: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[12] at parallelize at <console>:36

scala> val rdd = sc.parallelize(List(DataItem(1, "A", 123), DataItem(2, "B", 1234), DataItem(2, "C", 12345)))
rdd: org.apache.spark.rdd.RDD[DataItem] = ParallelCollectionRDD[13] at parallelize at <console>:35

scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}
res11: org.apache.spark.rdd.RDD[(Int, List[DataItem])] = MapPartitionsRDD[16] at map at <console>:38

scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.foreach(println)
(1,List(DataItem(1,A,123)))
(2,List(DataItem(2,B,1234), DataItem(2,C,12345)))

scala> rdd.groupBy(_.key).map { case (k, v) => k -> v.toList.sortBy(_.timeInMillis)}.map { case (k, v) => (k, v.map(_.value)) }.foreach(println)
(1,List(A))
(2,List(B, C))