Spark(scala)发布迭代器

时间:2017-02-06 12:06:40

标签: scala apache-spark

我正在为mappartition创建一个函数来计算每个分区的最大值和最小值。我在pyspark中创建了该函数,但我无法成功将其转换为scala。我应用这个函数两次,我想在结果中运行一个zip。这是我得到的错误:

result.zip(RES)

类型不匹配;

[error]  found   : org.apache.spark.rdd.RDD[(Int, Int)]
[error]  required: scala.collection.GenIterable[?]

这里有python中的函数:

def minmaxInt(iterator):
    firsttime = 0
    min = 0
    max = 0
    for x in iterator:
    if(x!= '' and x!='NULL' and x is not None): 
        y=int(x)    
            if (firsttime == 0):
                    min = y;
                    max = y;
                    firsttime = 1
            else:
                    if y > max:
                        max = y
                    if y < min:
                        min = y
    return (min, max)

这是我在Scala中的代码

def minmaxInt(iterator: Iterator[String]) : Iterator[(Int,Int)]={

    var firsttime = 0
    var min = 0
    var max = 0
    var res=List[(Int,Int)]()
    for( x <- iterator){
      if(x!= "" && x!= null){
    var y=x.toInt

        if(firsttime == 0){
            min = y
            max = y
            firsttime = 1}
        else{
            if (y > max){
                max = y}
            if (y < min){
                min = y}
        }
       }
     }

    res.::=(min,max)
    return res.iterator

}

提前谢谢

更新:

感谢您的快速回复!代码很棒,但我仍然有拉链问题。我有两次rdd.mapPartitions的最后一个代码,然后执行zip:

 [error]  found   : org.apache.spark.rdd.RDD[(Int, Int)]
[error]  required: scala.collection.GenIterable[?]
[error]             result.zip(res)

1 个答案:

答案 0 :(得分:0)

这是minMaxInt

的更简单(更惯用)的实现
def minMaxInt(iterator: Iterator[String]) : Iterator[(Int,Int)]= {
  val tuple = iterator
    .filter(_ != null).filter(!_.isEmpty)
    .map(_.toInt).map(i => (i, i))
    .reduce[(Int, Int)] { case ((min, max), (i1, i2)) => (Math.min(min, i1), Math.max(max, i2)) }

  Seq(tuple).iterator
}

可以应用于RDD[String],如下所示:

// some sample data
def col = sc.parallelize(Seq("1", "4", "12", "3", "", null, "2"))

// "use twice" and zip
var result: RDD[(Int, Int)] = col.mapPartitions(minmaxInt)
var res: RDD[(Int, Int)] = col.mapPartitions(minmaxInt)

result.zip(res).foreach(println)
// prints:
// ((1,1),(1,1))
// ((2,2),(2,2))
// ((3,3),(3,3))
// ((4,12),(4,12))