我正在为mappartition创建一个函数来计算每个分区的最大值和最小值。我在pyspark中创建了该函数,但我无法成功将其转换为scala。我应用这个函数两次,我想在结果中运行一个zip。这是我得到的错误:
result.zip(RES)
类型不匹配;
[error] found : org.apache.spark.rdd.RDD[(Int, Int)]
[error] required: scala.collection.GenIterable[?]
这里有python中的函数:
def minmaxInt(iterator):
firsttime = 0
min = 0
max = 0
for x in iterator:
if(x!= '' and x!='NULL' and x is not None):
y=int(x)
if (firsttime == 0):
min = y;
max = y;
firsttime = 1
else:
if y > max:
max = y
if y < min:
min = y
return (min, max)
这是我在Scala中的代码
def minmaxInt(iterator: Iterator[String]) : Iterator[(Int,Int)]={
var firsttime = 0
var min = 0
var max = 0
var res=List[(Int,Int)]()
for( x <- iterator){
if(x!= "" && x!= null){
var y=x.toInt
if(firsttime == 0){
min = y
max = y
firsttime = 1}
else{
if (y > max){
max = y}
if (y < min){
min = y}
}
}
}
res.::=(min,max)
return res.iterator
}
提前谢谢
更新:
感谢您的快速回复!代码很棒,但我仍然有拉链问题。我有两次rdd.mapPartitions的最后一个代码,然后执行zip:
[error] found : org.apache.spark.rdd.RDD[(Int, Int)]
[error] required: scala.collection.GenIterable[?]
[error] result.zip(res)
答案 0 :(得分:0)
这是minMaxInt
:
def minMaxInt(iterator: Iterator[String]) : Iterator[(Int,Int)]= {
val tuple = iterator
.filter(_ != null).filter(!_.isEmpty)
.map(_.toInt).map(i => (i, i))
.reduce[(Int, Int)] { case ((min, max), (i1, i2)) => (Math.min(min, i1), Math.max(max, i2)) }
Seq(tuple).iterator
}
可以应用于RDD[String]
,如下所示:
// some sample data
def col = sc.parallelize(Seq("1", "4", "12", "3", "", null, "2"))
// "use twice" and zip
var result: RDD[(Int, Int)] = col.mapPartitions(minmaxInt)
var res: RDD[(Int, Int)] = col.mapPartitions(minmaxInt)
result.zip(res).foreach(println)
// prints:
// ((1,1),(1,1))
// ((2,2),(2,2))
// ((3,3),(3,3))
// ((4,12),(4,12))