我尝试使用此代码找出原因,但出现错误:
val keysWithValuesList = Array("1=2000", "2=1800", "2=3000", "3=2500", "4=1500")
val data = sc.parallelize(keysWithValuesList,2)
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = kv.first._2
val maxi = (x: Int, y: Int) => if (x>y) x else y
val mini = (x: Int, y: Int) => if (x>y) y else x
val maxP = (p1: Int, p2: Int) => if (p1>p2) p1 else p2
val minP = (p1: Int, p2: Int) => if (p1>p2) p2 else p1
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))
错误是:-
command-2654386024166474:13: error: type mismatch;
found : ((Int, Int) => Int, (Int, Int) => Int)
required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))
^
command-2654386024166474:13: error: type mismatch;
found : ((Int, Int) => Int, (Int, Int) => Int)
required: (Int, Int) => Int
val max_min = kv.aggregateByKey(initialCount)((maxi,mini),(maxP,minP))
还有其他方法吗?请建议
答案 0 :(得分:0)
可以一次执行两次归约运算,但是您将需要使用元组。首先格式化您的RDD以重复该值:
val rddMinMax = kv.map(x => (x._1, (x._2, x._2)))
然后使用此功能将每对减少两次:
val minAndMax = ((l1: (Int, Int), l2: (Int, Int)) => (mini(l1._1, l2._1), maxi(l1._2, l2._2)))
rddMinMax.reduceByKey(minAndMax).collect()
答案 1 :(得分:0)
我找到了解决方案:-
val list = Array("1=2000", "2=1800", "2=500", "3=2500", "4=4500")
val data = sc.parallelize(list,6)
//Create key value pairs
val kv = data.map(_.split("=")).map(v => (1, v(1).toInt))
val initialCount = (kv.first._2, kv.first._2)
val min_max = (x:(Int,Int),y:Int) => {(if (x._1>y) x._1 else y, if(x._2>y) y else x._2)}
val min_maxP=(p1:(Int,Int),p2:(Int,Int)) => {(if (p1._1>p2._1) p1._1 else p2._1, if (p1._2>p2._2) p2._2 else p1._2)}
val minimum = kv.aggregateByKey(initialCount)(min_max,min_maxP)
minimum.first._2
输出为:-
list: Array[String] = Array(1=2000, 2=1800, 2=500, 3=2500, 4=4500)
data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[164] at parallelize at command-110260081440638:2
kv: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[166] at map at command-110260081440638:4
initialCount: (Int, Int) = (2000,2000)
min_max: ((Int, Int), Int) => (Int, Int) = <function2>
min_maxP: ((Int, Int), (Int, Int)) => (Int, Int) = <function2>
minimum: org.apache.spark.rdd.RDD[(Int, (Int, Int))] = ShuffledRDD[167] at aggregateByKey at command-110260081440638:8
res29: (Int, Int) = (4500,500)