如何在Spark RDD中获得Avg和Sum

时间:2015-12-31 13:11:15

标签: scala apache-spark

鉴于我有Spark功能:

val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map(grouped => grouped._1 -> grouped._2.toSet)

group.foreach(g => println(g))

我得到了:

(639461796080961,Set(15))
(214680441881239,Set(5, 10, 25, -99, 99, 19, 100))
(203328349712668,Set(5, 10, 15, -99, 99, 15, 10))

是否可以向此功能添加Map(),并在每个集合中添加avgsum?例如:

(639461796080961,Map("data" -> Set(5, 10, 25, -99, 99, 19, 100), "avg" -> 22.71, "sum" -> 159))

2 个答案:

答案 0 :(得分:2)

我建议使用Tuple或案例类而不是Map。我的意思大致是这样的:

case class Location(id: Long, values: Set[Int], sum: Int, avg: Double)

val group = whereRdd
  .map(collection => 
    collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map{case (id, values) => {
    val set = values.toSet
    val sum = set.sum
    val mean = sum / set.size.toDouble
    Location(id, set, sum, mean)
  }}

优于Map的最大优势在于它使类型保持有序。

答案 1 :(得分:1)

在阅读@ zero323回答后,我添加了Actions并且它有效:

Actions actions = new Actions(driver);
actions.moveToElement(element);
actions.click();
actions.sendKeys("SOME DATA");
actions.build().perform();

我得到了:

Map()