ReduceByKey函数 - Spark Python

时间:2017-03-19 18:53:16

标签: python apache-spark apache-spark-sql spark-dataframe

我有一个RDD;

[(25995522, '2013-03-04 21:55:42.000000'),
 (25995522, '2013-03-15 03:51:30.000000'),
 (25995522, '2013-03-07 01:47:45.000000'),
 (52198733, '2013-03-17 20:54:41.000000'),
 (52198733, '2013-03-11 02:56:47.000000'),
 (52198733, '2013-03-13 10:00:04.000000'),
 (52198733, '2013-03-13 23:29:26.000000'),
 (52198733, '2013-03-04 21:44:58.000000'),
 (53967034, '2013-03-13 17:55:40.000000'),
 (53967034, '2013-03-14 04:03:55.000000')]

我想在日期最短的日期减少它们。输出应该是;

[(25995522, '2013-03-04 21:55:42.000000'),
 (52198733, '2013-03-04 21:44:58.000000'),
 (53967034, '2013-03-13 17:55:40.000000')]

我如何按日期减少它们,而不是使用" .reduceByKey(add)"?提前致谢

1 个答案:

答案 0 :(得分:0)

res = rdd.mapValues(lambda x:datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f')).reduceByKey(lambda x, y: min(x, y))

或者:

rdd.groupByKey不会提供更好的效果。