如何在火花中使用聚合按键同时查找最大值和最小值?

时间:2020-02-21 07:16:54

标签: scala apache-spark

我尝试使用此代码找出原因,但出现错误:

val keysWithValuesList = Array("1=2000", "2=1800", "2=3000", "3=2500", "4=1500")
val data = sc.parallelize(keysWithValuesList,2)
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = kv.first._2
val maxi = (x: Int, y: Int) => if (x>y) x else y 
val mini = (x: Int, y: Int) => if (x>y) y else x 
val maxP = (p1: Int, p2: Int) => if (p1>p2) p1 else p2
val minP = (p1: Int, p2: Int) => if (p1>p2) p2 else p1
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))

错误是:-

command-2654386024166474:13: error: type mismatch;
 found   : ((Int, Int) => Int, (Int, Int) => Int)
 required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))
                                              ^
command-2654386024166474:13: error: type mismatch;
 found   : ((Int, Int) => Int, (Int, Int) => Int)
 required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))

还有其他方法吗?请建议

2 个答案:

答案 0 :(得分:0)

可以一次执行两次归约运算,但是您将需要使用元组。首先格式化您的RDD以重复该值:

val rddMinMax = kv.map(x => (x._1, (x._2, x._2)))

然后使用此功能将每对减少两次:

val minAndMax = ((l1: (Int, Int), l2: (Int, Int)) => (mini(l1._1, l2._1), maxi(l1._2, l2._2)))
rddMinMax.reduceByKey(minAndMax).collect()

答案 1 :(得分:0)

我找到了解决方案:-

val list = Array("1=2000", "2=1800", "2=500", "3=2500", "4=4500")
val data = sc.parallelize(list,6)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = (kv.first._2, kv.first._2)
val min_max = (x:(Int,Int),y:Int) => {(if (x._1>y) x._1 else y, if(x._2>y) y else x._2)} 
val min_maxP=(p1:(Int,Int),p2:(Int,Int)) => {(if (p1._1>p2._1) p1._1 else p2._1, if (p1._2>p2._2) p2._2 else p1._2)}
val minimum = kv.aggregateByKey(initialCount)(min_max,min_maxP)
minimum.first._2

输出为:-

list: Array[String] = Array(1=2000, 2=1800, 2=500, 3=2500, 4=4500)
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[164] at parallelize at command-110260081440638:2
kv: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[166] at map at command-110260081440638:4
initialCount: (Int, Int) = (2000,2000)
min_max: ((Int, Int), Int) => (Int, Int) = <function2>
min_maxP: ((Int, Int), (Int, Int)) => (Int, Int) = <function2>
minimum: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = ShuffledRDD[167] at aggregateByKey at command-110260081440638:8
res29: (Int, Int) = (4500,500)
相关问题