Spark RDD对简单集合的递归操作

时间:2018-04-29 20:06:34

标签: apache-spark recursion rdd

我在RDD中有用户信息:

(Id:10, Name:bla, Adress:50, ...)

我还有另一个集合,其中包含我们为每个用户收集的身份的连续更改。

(lastId, newId)
    (10, 43)
    (85, 90)
    (43, 50)

我需要获取每个用户ID的最后一个身份,在此示例中为:

getFinalIdentity(10) = 50     (10 -> 43 -> 50)

有一段时间我使用了包含这些身份的广播变量,并在集合上迭代以获得最终ID。 一切都工作正常,直到参考变得太大而无法适应广播变量......

我想出了一个解决方案,使用RDD存储身份并递归迭代,但它不是很快,看起来非常复杂。

是否有一种优雅而快速的方法来制作它?

1 个答案:

答案 0 :(得分:1)

你有没有想过图表?

您可以将边缘列表中的图形创建为(lastId, newId)。这样,没有传出边的节点是没有传入边的节点的最终id。

可以使用GraphX在Spark中完成。

以下是一个例子。它为每个Id显示链中第一个ID的Id。这意味着,对于(1 -> 2 -> 3)的此更改,结果将为(1, 1), (2, 1), (3, 1)

import org.apache.spark.graphx.{EdgeDirection, EdgeTriplet, Graph, VertexId}
import org.apache.spark.{SparkConf, SparkContext}

object Main {

  val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
  val sc = new SparkContext(conf)

  def main(args: Array[String]): Unit = {

    sc.setLogLevel("ERROR")

    // RDD of pairs (oldId, newId)
    val changedIds = sc.parallelize(Seq((1L, 2L), (2L, 3L), (3L, 4L), (10L, 20L), (20L, 31L), (30L, 40L), (100L, 200L), (200L, 300L)))

    // case classes for pregel operation
    case class Value(originId: VertexId)      // vertex value
    case class Message(value: VertexId)       // message sent from one vertex to another

    // Create graph from id pairs
    val graph = Graph.fromEdgeTuples(changedIds, Value(0))

    // Initial message will be sent to all vertexes at the start
    val initialMsg = Message(0)

    // How vertex should process received message
    def onMsgReceive(vertexId: VertexId, value: Value, msg: Message): Value = {
      // Initial message will have value 0. In that case current vertex need to initialize its value to its own ID
      if (msg.value == 0) Value(vertexId)
      // Otherwise received value is initial ID
      else Value(msg.value)
    }

    // How vertexes should send messages
    def sendMsg(triplet: EdgeTriplet[Value, Int]): Iterator[(VertexId, Message)] = {
      // For the triplet only single message shall be sent to destination vertex
      // Its payload is source vertex origin ID
      Iterator((triplet.dstId, Message(triplet.srcAttr.originId)))
    }

    // How incoming messages to one vertex should be merged
    def mergeMsg(msg1: Message, msg2: Message): Message = {
      // Generally for this case it's an error
      // Because one ID can't have 2 different originIDs
      msg2    // Just return any of the incoming messages
    }

    // Kick out pregel calculation
    val res = graph
      .pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(onMsgReceive, sendMsg, mergeMsg)

    // Print results
    res.vertices.collect().foreach(println)
  }
}

输出:(finalId firstId)

(100,Value(100))
(4,Value(1))
(300,Value(100))
(200,Value(100))
(40,Value(30))
(20,Value(10))
(1,Value(1))
(30,Value(30))
(10,Value(10))
(2,Value(1))
(3,Value(1))
(31,Value(10))