累加器值始终为0

时间:2019-05-20 06:56:49

标签: scala apache-spark

始终将累加器值设为0。

package com.fast.processing.data

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object AccumulatorExample {
      def main(args:Array[String]){

      val spark = new SparkConf().setAppName("AccumulatorExample").setMaster("local")
      val sc = new SparkContext(spark)

      val data = sc.textFile("C:\\Users\\SportsData.txt")
      val badLines = sc.accumulator(0,"badLines");

      val datVal = data.foreach(line =>(line.split(",").map{x=>{
                  if(x(0).toInt < 0) badLines +=1
                }  
              }

      ) )
      println("Val of bad lines is:::"+badLines)
  }

}

下面是数据,我希望累加器值为4,因为每一行的第一个值都小于0。

-1,10,India,2019,01-01-2019,Cricket,5,6,7,18 
-2,11,Japan,2018,01-01-2018,Football,6,6,6,18
-3,12,China,2017,01-01-2017,Tennis,7,7,7,21 
-4,13,India,2018,01-01-2017,Swimming,8,8,8,24 
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25 
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25 
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25 
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25 
  1. 列表项

2 个答案:

答案 0 :(得分:1)

问题不在累加器中,而是在这里

if(x(0).toInt < 0)

x的类型为String,因此x(0)指的是字符串的第一个字符,toInt会将其转换为相应的代码点值,即{{1 }}。

有很多方法可以做到这一点,例如,这将起作用:

-

P.S。 Scala方法val datVal = data.foreach { line => "^-\\d+,".r.findFirstMatchIn(line).foreach(_ => badLines += 1) } 不适用于副作用,而应使用map

答案 1 :(得分:-1)

您也可以使用过滤器来计算不良记录,不需要累加器

val result = data.filter(line => {
  Try {
    line.split(",")(0).toInt
  } match {
    case scala.util.Success(value) => false
    case _ => true
  }
})
println(result.count())