Question

鉴于我有Spark功能：

val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map(grouped => grouped._1 -> grouped._2.toSet)

group.foreach(g => println(g))

我得到了：

(639461796080961,Set(15))
(214680441881239,Set(5, 10, 25, -99, 99, 19, 100))
(203328349712668,Set(5, 10, 15, -99, 99, 15, 10))

是否可以向此功能添加Map()，并在每个集合中添加avg和sum？例如：

(639461796080961,Map("data" -> Set(5, 10, 25, -99, 99, 19, 100), "avg" -> 22.71, "sum" -> 159))

Answer 1

我建议使用Tuple或案例类而不是Map。我的意思大致是这样的：

case class Location(id: Long, values: Set[Int], sum: Int, avg: Double)

val group = whereRdd
  .map(collection => 
    collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map{case (id, values) => {
    val set = values.toSet
    val sum = set.sum
    val mean = sum / set.size.toDouble
    Location(id, set, sum, mean)
  }}

优于Map的最大优势在于它使类型保持有序。

Answer 2

在阅读@ zero323回答后，我添加了Actions并且它有效：

Actions actions = new Actions(driver);
actions.moveToElement(element);
actions.click();
actions.sendKeys("SOME DATA");
actions.build().perform();

我得到了：

Map()

如何在Spark RDD中获得Avg和Sum

2 个答案: