未获得scala中reduceGroup函数的预期输出

时间:2018-04-11 12:51:01

标签: scala apache-spark dataframe

我试图根据条件求和一个值。一个变量是在两个条件下添加,另一个变量是在一个条件上添加:

val record = file.map(rec => (rec.state,rec.gender,rec.Generated.toInt)).groupByKey(_._1)
    .reduceGroups((a,b)=>{
    var total:Int = 0
    var mTotal:Int = 0
    if(a._2.trim().equalsIgnoreCase("m")){
      mTotal = a._3 + b._3
      total = a._3 + b._3
    }else{
      total = a._3 + b._3
    }
    (a._1,mTotal.toString(),total)
    }).collect

我得到变量总和的总和,但是mTotal的值为0。 任何想法为什么我得到0。 样本数据:

20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,20,1,0,0,1)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,28,1,0,0,0)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,38,1,0,0,0)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,50,1,0,0,0)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,54,1,0,0,0)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224152,F,72,1,0,0,0)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224155,m,6,1,0,0,1)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224155,m,7,2,0,0,2)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224155,m,8,2,0,0,2)
20150420,Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur,224155,m,9,3,0,0,3)

如果您可以提供我可以阅读有关函数的链接,例如redugroups,flatmap,sortBy ..等深度/细节。

提前致谢...

1 个答案:

答案 0 :(得分:1)

据我了解你的问题,你想计算有多少记录有字段“m”(是代表性别的M / F)?

我将建议以下内容使您的代码更具可读性。

1)定义一个案例类来保存您的数据:

case class Record(date: Int, bank: String, generated: Int, gender:Char, age: Int, ignore1: Int, ignore2: Int, ignore3: Int, ignore4: Int)    

然后,让我们把样本放到一个列表中:

val samples = List(
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 20, 1, 0, 0, 1),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 28, 1, 0, 0, 0),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 38, 1, 0, 0, 0),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 50, 1, 0, 0, 0),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 54, 1, 0, 0, 0),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224152, 'F', 72, 1, 0, 0, 0),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224155, 'm', 6, 1, 0, 0, 1),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224155, 'm', 7, 2, 0, 0, 2),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224155, 'm', 8, 2, 0, 0, 2),
Record(20150420, "Allahabad Bank,A-Onerealtors Pvt Ltd,Uttar Pradesh,Ambedkar Nagar,Akbarpur", 224155, 'm', 9, 3, 0, 0, 3)) 

现在,我们可以使用for comprehension来循环遍历集合,使用if子句来过滤我们需要的元素。最后,我们应用size函数来计算记录数量。

val howManyMen = { for (record <- samples if (record.gender.toLower.equals('m'))) yield record }.size

最后,我们可以打印值:

println(s"Found men :$howManyMen") //> Found men :4

}

希望这会有所帮助 - 尝试以可读的方式构建代码!