减少Spark RDD以返回多个值

时间:2016-07-23 05:57:42

标签: apache-spark reduce

我有以下RDD包含项目集合,我想按项目相似性进行分组(同一组中的项目被认为是相似的。相似性是可传递的,并且具有至少一个共同项目的集合中的所有项目也被认为是相似的)

输入RDD:

Set(w1, w2)
Set(w1, w2, w3, w4)
Set(w5, w2, w6)
Set(w7, w8, w9)
Set(w10, w5, w8) --> All the first 5 set elements are similar as each of the sets have atleast one common item
Set(w11, w12, w13)

我希望将上述RDD简化为

Set(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
Set(w11, w12, w13)

有关如何做到这一点的任何建议?我无法做下面的事情,如果它们不包含任何共同元素,我可以忽略减少两个集合:

data.reduce((a,b) => if (a.intersect(b).size > 0) a ++ b ***else (a,b)***)

感谢。

1 个答案:

答案 0 :(得分:0)

您的reduce算法实际上是不正确的。例如,如果一个集合不能与下一个集合合并但仍然可以与集合中的另一个集合合并。

可能有更好的方法,但我想通过将其转换为图形问题并使用Graphx来解决这个问题。

val data = Array(Set("w1", "w2", "w3"), Set("w5", "w6"), Set("w7"), Set("w2", "w3", "w4"))
val setRdd = sc.parallelize(data).cache

// Generate an unique id for each item to use as vertex's id in the graph
val itemToId = setRdd.flatMap(_.toSeq).distinct.zipWithUniqueId.cache
val idToItem = itemToId.map { case (item, itemId) => (itemId, item) }

// Convert to a RDD of set of itemId
val newSetRdd = setRdd.zipWithUniqueId
  .flatMap { case (sets, setId) =>
    sets.map { item => (item, setId) }
  }.join(itemToId).values.groupByKey().values

// Create an RDD containing edges of the graph
val edgeRdd = newSetRdd.flatMap { set =>
    val seq = set.toSeq
    val head = seq.head
    // Add an edge from the first item to each item in a set, 
    // including itself
    seq.map { item => Edge[Long](head, item)}
  }

val graph = Graph.fromEdges(edgeRdd, Nil)

// Run connected component algorithm to check which items are similar.
// Items in the same component are similar
val verticesRDD = graph.connectedComponents().vertices

verticesRDD.join(idToItem).values.groupByKey.values.collect.foreach(println)