如何从RDD [String,List [java.sql.date]],Scala中获取最早的日期

时间:2018-09-30 02:41:56

标签: scala apache-spark rdd

我有下面的RDD,t1RDD2,只显示前五行:

http://localhost:3000

我试图从缓冲区中获取最早的日期,但是我的代码中出现了错误。

代码:

(000471242-01,CompactBuffer(2012-05-07, 2006-11-15, 2014-10-08, 2010-05-20))
(996006688-01,CompactBuffer(2011-01-18, 2005-08-19, 2008-08-27, 2014-09-05, 2006-06-26, 2012-05-10, 2013-11-22, 2005-10-14, 2007-03-26, 2007-05-17, 2010-05-19, 2008-07-11, 2009-03-09))
(788000995-01,CompactBuffer(2006-01-06, 2013-05-01))
(525570000-01,CompactBuffer(2009-07-06, 2010-06-10, 2013-01-22, 2005-03-09, 2008-06-09, 2008-11-07))
(418500000-01,CompactBuffer(2007-07-09, 2011-02-16, 2012-10-16, 2005-10-18, 2009-05-11, 2008-01-22, 2014-07-08, 2010-01-04, 2009-03-23, 2013-08-16))

错误:

val t1RDD = t1RDD2.reduceByKey((date1, date2) => if (date1.before(date2)) date1 else date2)

有什么建议吗?

1 个答案:

答案 0 :(得分:1)

显然,您的t1RDD2等于PairRDD上groupByKey的结果,如下所示(带有简化的样本数据):

import java.sql.Date

val rdd = sc.parallelize(Seq(
  ("000471242-01", Date.valueOf("2012-05-07")),
  ("000471242-01", Date.valueOf("2006-11-15")),
  ("996006688-01", Date.valueOf("2011-01-18")),
  ("996006688-01", Date.valueOf("2005-08-19")),
  ("996006688-01", Date.valueOf("2008-08-27"))
))

val t1RDD2 = rdd.groupByKey
// t1RDD2: org.apache.spark.rdd.RDD[(String, Iterable[java.sql.Date])] = ...

t1RDD2.collect
// res1: Array[(String, Iterable[java.sql.Date])] = Array(
//   (996006688-01,CompactBuffer(2011-01-18, 2005-08-19, 2008-08-27)),
//   (000471242-01,CompactBuffer(2012-05-07, 2006-11-15))
// )

如果要从t1RDD2获取每个键的最早日期,请使用mapreduce值列以获取最小值:

t1RDD2.map{ case (k, v) => ( k, v.reduce((min, d) => if (min.before(d)) min else d) ) }.
  collect
// res2: Array[(String, java.sql.Date)] = Array((996006688-01,2005-08-19), (000471242-01,2006-11-15))

但是,如果适用,最好直接从预先分组的RDD中执行reduceByKey

rdd.reduceByKey( (min, d) => if (min.before(d)) min else d ).
  collect
// res3: Array[(String, java.sql.Date)] = Array((996006688-01,2005-08-19), (000471242-01,2006-11-15))