我正在尝试计算数据集的平均值,我测试的数据是:
val arr = Array(("D1", List(("k1",100.10,4), ("k2",50.5,3))),
("D2", List(("k1",230.24,7), ("k3",157.2,5))),
("D3", List(("k2",120,6), ("k4",340.8,16))))
到目前为止所做的操作:
val s1 = sc.parallelize(arr.toSeq).flatMap { x => x._2.groupBy(_._1)}
val s2 = s1.map {
case (k, v) => (v(0)._1, (v(0)._2, v(0)._3))
}
val s3 = s2.groupByKey()
这个s3是org.apache.spark.rdd.RDD[(String, Iterable[(AnyVal, Int)])]
(k3,CompactBuffer((157.2,5)))
(k4,CompactBuffer((340.8,16)))
(k2,CompactBuffer((50.5,3), (120,6)))
(k1,CompactBuffer((100.1,4), (230.24,7)))
现在我想做一个操作,结果是:
(k3, ( 157.2 / 5)
(k4, ( 340.8 / 16))
(k2, ( (50.5 + 120) / (3 + 6) ))
(k1, ( (100.1 + 230.24) / (4 + 7) ))
我真的很困惑。我怎么能得到这个结果?
答案 0 :(得分:0)
首先,您应该删除sc.parallelize(arr.toSeq)
,然后执行sc.parallelize(arr)
。在处理元组时...尝试使用模式匹配来保持理性。
另外......从它的外观来看,你想对t._1
内的每个元组使用List
作为聚合的关键平均值。在这种情况下,您不需要任何groupBy
操作
val arr = Array(
("D1", List(("k1",100.10,4), ("k2",50.5,3))),
("D2", List(("k1",230.24,7), ("k3",157.2,5))),
("D3", List(("k2",120,6), ("k4",340.8,16)))
)
// RDD[(String, List[(String, Float, Int)])
val s = sc.parallelize(arr)
// RDD[List[(String, (Float, Int))]
val s2 = s.flatMap({
case (id, list) => list.map({
case (key, f1, i1) => (key, (f1, i1))
})
})
// you do not need s3 at all
// `groupByKey` in Spark is very costly.
// you already have a PairRDD in s2 with key -> String and val -> (Float, Int)
// just go ahead and aggregate them
// RDD[(String,(Float, Int))]
val initial = (0.0f, 0)
val s4 = s2.aggregateByKey(initial)(
{ case ((total, count), (f1, i1)) => (total + f1, count + i1) },
{ case ((total1, count1), (total2, count2)) => (total1 + total2, count1 + count2) }
)
// RDD[(String,Float)]
val s5 = s4.map{ case(key,(total, count)) => (key, total / count) }