Question

我想在Spark中做一个简单的MR工作，这里是代码：

val testRDD = someRDD.map((_, 1)).reduceByKey(_+_)

在map阶段，值为Int，如果在reduce阶段，该值太大而不能超过Int范围？我可能会这样做

val testRDD = someRDD.map((_, 1.toLong)).reduceByKey(_+_)

但还有更好的主意吗？

Answer 1

没有特定的Spark。它只会产生integer overflow：

sc.parallelize(Seq(("a", Integer.MAX_VALUE), ("a", 1))).reduceByKey(_ + _).first

// (String, Int) = (a,-2147483648)

如果您怀疑可能发生溢出错误，那么您一定要使用更合适的数据类型，Long是积分值的不错选择：

sc.parallelize(Seq(
  ("a", Integer.MAX_VALUE.toLong), ("a", 1L)
)).reduceByKey(_ + _).first

// (String, Long) = (a,2147483648)