Apache-Spark中的过滤器过滤RDD中比预期更多的行

时间:2017-06-13 15:46:01

标签: scala apache-spark filter

我有一个RDD,它与此

密切相关
  

1.0,2.0,0.0019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,3.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,5.0,-0.0019,-2.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.4294   1.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,7.0,0.0,1.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,8.0,0.0,3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9040.8,0.0,0.0,0.0,0.0,0.0,0.0   1.0,9.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0   1.0,10.0,-0.0033,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,47.03,0.0,0.0,0.0,0.0   1.0,11.0,0.0,-3.0E-4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,554.54,0.0,0.0,0.0,0.0,0.0,0.0,8140.58,0.0

我需要过滤零的数量等于特定数字的行,比如说15.过滤器方法的这个定义过滤的行数比预期的多。

def filterZeroRowsWReadings(row: Array[String]) = {
    var flag:Int = 0
    for(value <- row) {
      if(value.toDouble == 0.0)
        flag = flag + 1
    }
    flag match {
        case 15 => false
        case _ => true
    }
}

但是我已经在我的RDD子集中手动计算了零的行数到3,834,但上面的过滤方法正在删除3,960行。现在,我不明白这126行会在哪里?有没有办法让我知道发生了什么。在较小的RDD上,结果如预期的那样,但在大型RDD上,它在某种程度上是意料之外的。

感谢。

1 个答案:

答案 0 :(得分:1)

也许这是一个精确的问题?您可以尝试将每个值作为字符串与“0.0”进行比较,看看是否会发生任何变化。