如何使用spark / scala来压缩相邻数据

时间:2017-06-09 01:15:57

标签: scala apache-spark apache-spark-sql

我有一个RDDRDD类型是Tuple2(value,timestamp),值是1或0,时间戳是顺序,变量limitTime = 4。当我映射RDD时,如果值为1,则从当前时间戳到(timestamp + limitTime)的输出值为1,否则当前值为0,我将其称为句点。但是有一种特殊情况,当值为1并且其时间戳在句点中时,则忽略它,输出的当前值为0

input :          (0,0),(1,1),(0,3),(0,5),(0,7),(0,8),(0,10),(1,12),(0,14),(0,15)
expected output :(0,0),(1,1),(1,3),(1,5),(0,7),(0,8),(0,10),(1,12),(1,14),(1,15)

special input2:  (0,0),(1,1),(0,3),(1,5),(0,7),(1,8),(0,10),(1,12),(0,14),(0,15)
expected output2:(0,0),(1,1),(1,3),(1,5),(0,7),(1,8),(1,10),(1,12),(0,14),(0,15)
this is my try:

 var limitTime=4
    var startTime= -limitTime
  val rdd=sc.parallelize(List((0,0),(1,1),(0,3),(1,5),(0,7),(1,8),(0,10),(1,12),(0,14),(0,15)),4)
      val results=rdd.mapPartitions(parIter => {
        var resultIter = new ArrayBuffer[Tuple2[Int,Int]]()
        while (parIter.hasNext) {
          val iter = parIter.next()
          if(iter._1==1){
            if(iter._2<=startTime+limitTime&&iter._2!=0&&iter._2>=startTime){
              resultIter.append(iter)
            }else{
              resultIter.append(iter)
              startTime=iter._2
            }
          }else{
            if(iter._2<=startTime+limitTime&&iter._2!=0&&iter._2>=startTime){
              resultIter.append((1,iter._2))
            }else{
              resultIter.append(iter)
            }
          }
        }
        resultIter.toIterator
      })
    results.collect().foreach(println)

enter image description here 这是如此低效,如何在没有数组的情况下获得相同的结果?

1 个答案:

答案 0 :(得分:0)

Following code should work for both of your cases.

var limitTime=3
var first = true
var previousValue = 0
val rdd=sc.parallelize(List((0,0),(1,1),(0,3),(0,5),(0,7),(0,8),(0,10),(1,12),(0,14),(0,15)), 4)
val tempResult = rdd.collect.map(pair => {
  if(first){
    first = false
    previousValue = pair._1
    (pair._1, pair._2)
  }
  else {
    if ((pair._1 == 1 || previousValue == 1) && limitTime > 0) {
      limitTime -= 1
      previousValue = 1
      (1, pair._2)
    }
    else {
      if (limitTime == 0) limitTime = 3
      previousValue = pair._1
      (pair._1, pair._2)
    }
  }
})
tempResult.foreach(print)

If it doesn't please let me know