始终将累加器值设为0。
package com.fast.processing.data
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object AccumulatorExample {
def main(args:Array[String]){
val spark = new SparkConf().setAppName("AccumulatorExample").setMaster("local")
val sc = new SparkContext(spark)
val data = sc.textFile("C:\\Users\\SportsData.txt")
val badLines = sc.accumulator(0,"badLines");
val datVal = data.foreach(line =>(line.split(",").map{x=>{
if(x(0).toInt < 0) badLines +=1
}
}
) )
println("Val of bad lines is:::"+badLines)
}
}
下面是数据,我希望累加器值为4,因为每一行的第一个值都小于0。
-1,10,India,2019,01-01-2019,Cricket,5,6,7,18
-2,11,Japan,2018,01-01-2018,Football,6,6,6,18
-3,12,China,2017,01-01-2017,Tennis,7,7,7,21
-4,13,India,2018,01-01-2017,Swimming,8,8,8,24
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25
A5,14,Bhutan,2019,01-01-2017,Swimming,5,5,5,25
答案 0 :(得分:1)
问题不在累加器中,而是在这里
if(x(0).toInt < 0)
x
的类型为String
,因此x(0)
指的是字符串的第一个字符,toInt
会将其转换为相应的代码点值,即{{1 }}。
有很多方法可以做到这一点,例如,这将起作用:
-
P.S。 Scala方法val datVal = data.foreach { line =>
"^-\\d+,".r.findFirstMatchIn(line).foreach(_ => badLines += 1)
}
不适用于副作用,而应使用map
。
答案 1 :(得分:-1)
您也可以使用过滤器来计算不良记录,不需要累加器
val result = data.filter(line => {
Try {
line.split(",")(0).toInt
} match {
case scala.util.Success(value) => false
case _ => true
}
})
println(result.count())