Question

(accountid, mid, url, spent)
RDD(("55E5", 5, "https://www.google.com/", 5774),
("55E5", 5, "https://www.google.com/", 543),
("55E5", 5, "https://www.google.com/", 52),
("55E5", 5, "https://www.google.com/", 85),
("55E5", 5, "https://www.google.com/", 54),
("55E5", 5, "https://www.google.com/", 287),
("54XJ", 5, "https://www.google.com/", 853),
("54XJ", 5, "https://www.google.com/", 2),
("54XJ", 5, "https://www.google.com/", 55),
("54XJ", 5, "https://www.google.com/", 984),
("54XJ", 5, "https://www.google.com/", 24),
("54XJ", 5, "https://www.google.com/", 57))
("745K", 5, "https://www.google.com/", 853),
("745K", 5, "https://www.google.com/", 2),
("745K", 5, "https://www.google.com/", 55),
("745K", 5, "https://www.google.com/", 984),
("745K", 5, "https://www.google.com/", 24),
("745K", 5, "https://www.google.com/", 57))

假设我有这样的元组的RDD，但它们没有如上所述。我想仅返回上面每个帐户ID最高的3个。

我正在考虑通过.sortBy(x => (x._1, x._4))订购它们然后进行折叠，但我不知道如何添加回我的RDD。必须有一种更优雅的方式来做到这一点。在某些情况下，可能会有完全或少于3个项目，在这种情况下我想保留它们。

Answer 1

...我不知道如何添加回我的RDD ...

使用Spark时，您应该始终考虑将转换数据转换为新的RDD，而不是＆＃34;更新＆＃34; 某些现有的RDD： RDD是 immutable ，Spark支持通过一个RDD的转换计算到另一个RDD。

具体来说，看起来你想要做的就是＆＃34; group＆＃34;您的数据按ID，然后将一些逻辑（排序并占据前三名）应用于每个结果的＆＃34;组＆＃34;。以下是实现此目的的两种方法 - 一种是非常直接的实现此流程（组，使用sort + take的映射值），另一种是在特定情况下可能至关重要的优化（即，当单个键具有＆gt;数千时）记录）

// just an alias to make things shorter to write...
type Record = (String, Int, String, Int)

// simple, but potentially slow / risky:
// groupBy "collects" all records with same key into a single record, which means
// it can't scale well if a single key has many records:
val result1: RDD[Record] = rdd.groupBy(_._1).values.flatMap(_.toList.sortBy(-_._4).take(3))

// an alternative approach that does the same, but should be faster
// and less fragile - at no point would we collect all records of a single key
// into a collection in one worker's memory. We do that by replacing "groupByKey"
// with "aggregateByKey" with functions that would keep only top 3 items per key at all times
val result2: RDD[Record] = rdd.keyBy(_._1)
  .aggregateByKey(mutable.SortedSet[Record]()(Ordering.by(-_._4)))(
    { case (list, item) => (list + item).take(3) },
    { case (list1, list2) => (list1 ++ list2).take(3) }
  ).values
  .flatMap(_.toList)

使用Spark中的reduce进行过滤

1 个答案: