Spark,Scala:如何根据密钥减去RDD对中的值?

时间:2016-10-03 05:05:06

标签: scala apache-spark

我有几个类型的RDD:RDD [(String,Int)]。我想根据键减去整数值。

这是一个例子:如果输入的RDD是

Valid_ record  = (TcustomerTDL_2016266,16) 
deleted_record = (TcustomerTDL_2016266,8) 

由于键值相同,因此必须减去整数值。我尝试使用“SubtractByKey”但它似乎不起作用。所以预期的结果是(TcustomerTDL_2016266,8),这是16-8 = 8.`

我使用了以下代码:

val changes_total = valid_record.subtractByKey(deleted_record).

如果有其他方法可以做到这一点,或者这是不正确的,请告诉我。

以下是代码:

val Conf = new SparkConf().setAppName("Module").setMaster("local")
val sc = new SparkContext(Conf)
val incoming_file =sc.wholeTextFiles("D:/Users/Documents/siva_hourly") //changed code
val output = incoming_file.map{case(k,v) => (k.split("/")(6),v.split("\\r?\\n"))} 
output.cache()
val change_type = output.map{case (k,v) => (k,(v.toList.map( x => x.split("\001")(2))))} //changed code
val change_delete_count = change_type.map{case(k,v) => (k,(v.filter{ x => x == "D" }).length)}
val change_record_foreach4 = change_delete_count.map{case(k,v) => (k.split("_"),v)} 
val change_record_foreach3 = change_record_foreach4.map{case(k,v)=>(k(0)+'_'+k(1),v)}
val change_valid_count = change_type.map{case(k,v) => (k,(v.filter{ x => x =="A" || x == "I"}).length)}
val change_record_foreach = change_valid_count.map{case(k,v) => (k.split("_"),v)}   
val change_record_foreach1 = change_record_foreach.map{case(k,v)=>(k(0)+'_'+k(1),v)}
val valid_record = change_record_foreach1.reduceByKey((x, y) => x + y)
val deleted_record = change_record_foreach3.reduceByKey((x, y) => x + y)
val changes_total = valid_record.subtractByKey(deleted_record)

1 个答案:

答案 0 :(得分:5)

这不是subtractByKey

的正确用法

以下是subtractByKey如何工作的示例

假设您有两个RDD,如下所示。

two pair RDDs (rdd = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)})

rdd.subtractByKey(other)

结果如下

{(1, 2)}

你可以这样做

val joinRDD = Valid_ record .join(deleted_record)
val resultRDD = joinRDD.mapValues(x => x._1 - x._2)