(accountid, mid, url, spent)
RDD(("55E5", 5, "https://www.google.com/", 5774),
("55E5", 5, "https://www.google.com/", 543),
("55E5", 5, "https://www.google.com/", 52),
("55E5", 5, "https://www.google.com/", 85),
("55E5", 5, "https://www.google.com/", 54),
("55E5", 5, "https://www.google.com/", 287),
("54XJ", 5, "https://www.google.com/", 853),
("54XJ", 5, "https://www.google.com/", 2),
("54XJ", 5, "https://www.google.com/", 55),
("54XJ", 5, "https://www.google.com/", 984),
("54XJ", 5, "https://www.google.com/", 24),
("54XJ", 5, "https://www.google.com/", 57))
("745K", 5, "https://www.google.com/", 853),
("745K", 5, "https://www.google.com/", 2),
("745K", 5, "https://www.google.com/", 55),
("745K", 5, "https://www.google.com/", 984),
("745K", 5, "https://www.google.com/", 24),
("745K", 5, "https://www.google.com/", 57))
假设我有这样的元组的RDD,但它们没有如上所述。我想仅返回上面每个帐户ID最高的3个。
我正在考虑通过.sortBy(x => (x._1, x._4))
订购它们然后进行折叠,但我不知道如何添加回我的RDD。必须有一种更优雅的方式来做到这一点。在某些情况下,可能会有完全或少于3个项目,在这种情况下我想保留它们。
答案 0 :(得分:1)
...我不知道如何添加回我的RDD ...
使用Spark时,您应该始终考虑将转换数据转换为新的RDD,而不是"更新" 某些现有的RDD: RDD是 immutable ,Spark支持通过一个RDD的转换计算到另一个RDD。
具体来说,看起来你想要做的就是" group"您的数据按ID,然后将一些逻辑(排序并占据前三名)应用于每个结果的"组"。以下是实现此目的的两种方法 - 一种是非常直接的实现此流程(组,使用sort + take的映射值),另一种是在特定情况下可能至关重要的优化(即,当单个键具有>数千时)记录)
// just an alias to make things shorter to write...
type Record = (String, Int, String, Int)
// simple, but potentially slow / risky:
// groupBy "collects" all records with same key into a single record, which means
// it can't scale well if a single key has many records:
val result1: RDD[Record] = rdd.groupBy(_._1).values.flatMap(_.toList.sortBy(-_._4).take(3))
// an alternative approach that does the same, but should be faster
// and less fragile - at no point would we collect all records of a single key
// into a collection in one worker's memory. We do that by replacing "groupByKey"
// with "aggregateByKey" with functions that would keep only top 3 items per key at all times
val result2: RDD[Record] = rdd.keyBy(_._1)
.aggregateByKey(mutable.SortedSet[Record]()(Ordering.by(-_._4)))(
{ case (list, item) => (list + item).take(3) },
{ case (list1, list2) => (list1 ++ list2).take(3) }
).values
.flatMap(_.toList)