使用Spark中的reduce进行过滤

时间:2017-05-02 19:31:02

标签: scala apache-spark

(accountid, mid, url, spent)
RDD(("55E5", 5, "https://www.google.com/", 5774),
("55E5", 5, "https://www.google.com/", 543),
("55E5", 5, "https://www.google.com/", 52),
("55E5", 5, "https://www.google.com/", 85),
("55E5", 5, "https://www.google.com/", 54),
("55E5", 5, "https://www.google.com/", 287),
("54XJ", 5, "https://www.google.com/", 853),
("54XJ", 5, "https://www.google.com/", 2),
("54XJ", 5, "https://www.google.com/", 55),
("54XJ", 5, "https://www.google.com/", 984),
("54XJ", 5, "https://www.google.com/", 24),
("54XJ", 5, "https://www.google.com/", 57))
("745K", 5, "https://www.google.com/", 853),
("745K", 5, "https://www.google.com/", 2),
("745K", 5, "https://www.google.com/", 55),
("745K", 5, "https://www.google.com/", 984),
("745K", 5, "https://www.google.com/", 24),
("745K", 5, "https://www.google.com/", 57))

假设我有这样的元组的RDD,但它们没有如上所述。我想仅返回上面每个帐户ID最高的3个。

我正在考虑通过.sortBy(x => (x._1, x._4))订购它们然后进行折叠,但我不知道如何添加回我的RDD。必须有一种更优雅的方式来做到这一点。在某些情况下,可能会有完全或少于3个项目,在这种情况下我想保留它们。

1 个答案:

答案 0 :(得分:1)

  

...我不知道如何添加回我的RDD ...

使用Spark时,您应该始终考虑将转换数据转换为新的RDD,而不是"更新" 某些现有的RDD: RDD是 immutable ,Spark支持通过一个RDD的转换计算到另一个RDD。

具体来说,看起来你想要做的就是" group"您的数据按ID,然后将一些逻辑(排序并占据前三名)应用于每个结果的"组"。以下是实现此目的的两种方法 - 一种是非常直接的实现此流程(组,使用sort + take的映射值),另一种是在特定情况下可能至关重要的优化(即,当单个键具有>数千时)记录)

// just an alias to make things shorter to write...
type Record = (String, Int, String, Int)

// simple, but potentially slow / risky:
// groupBy "collects" all records with same key into a single record, which means
// it can't scale well if a single key has many records:
val result1: RDD[Record] = rdd.groupBy(_._1).values.flatMap(_.toList.sortBy(-_._4).take(3))

// an alternative approach that does the same, but should be faster
// and less fragile - at no point would we collect all records of a single key
// into a collection in one worker's memory. We do that by replacing "groupByKey"
// with "aggregateByKey" with functions that would keep only top 3 items per key at all times
val result2: RDD[Record] = rdd.keyBy(_._1)
  .aggregateByKey(mutable.SortedSet[Record]()(Ordering.by(-_._4)))(
    { case (list, item) => (list + item).take(3) },
    { case (list1, list2) => (list1 ++ list2).take(3) }
  ).values
  .flatMap(_.toList)