我在RDD中有用户信息:
(Id:10, Name:bla, Adress:50, ...)
我还有另一个集合,其中包含我们为每个用户收集的身份的连续更改。
(lastId, newId)
(10, 43)
(85, 90)
(43, 50)
我需要获取每个用户ID的最后一个身份,在此示例中为:
getFinalIdentity(10) = 50 (10 -> 43 -> 50)
有一段时间我使用了包含这些身份的广播变量,并在集合上迭代以获得最终ID。 一切都工作正常,直到参考变得太大而无法适应广播变量......
我想出了一个解决方案,使用RDD存储身份并递归迭代,但它不是很快,看起来非常复杂。
是否有一种优雅而快速的方法来制作它?
答案 0 :(得分:1)
你有没有想过图表?
您可以将边缘列表中的图形创建为(lastId, newId)
。这样,没有传出边的节点是没有传入边的节点的最终id。
可以使用GraphX在Spark中完成。
以下是一个例子。它为每个Id显示链中第一个ID的Id。这意味着,对于(1 -> 2 -> 3)
的此更改,结果将为(1, 1), (2, 1), (3, 1)
import org.apache.spark.graphx.{EdgeDirection, EdgeTriplet, Graph, VertexId}
import org.apache.spark.{SparkConf, SparkContext}
object Main {
val conf = new SparkConf().setAppName("myapp").setMaster("local[*]")
val sc = new SparkContext(conf)
def main(args: Array[String]): Unit = {
sc.setLogLevel("ERROR")
// RDD of pairs (oldId, newId)
val changedIds = sc.parallelize(Seq((1L, 2L), (2L, 3L), (3L, 4L), (10L, 20L), (20L, 31L), (30L, 40L), (100L, 200L), (200L, 300L)))
// case classes for pregel operation
case class Value(originId: VertexId) // vertex value
case class Message(value: VertexId) // message sent from one vertex to another
// Create graph from id pairs
val graph = Graph.fromEdgeTuples(changedIds, Value(0))
// Initial message will be sent to all vertexes at the start
val initialMsg = Message(0)
// How vertex should process received message
def onMsgReceive(vertexId: VertexId, value: Value, msg: Message): Value = {
// Initial message will have value 0. In that case current vertex need to initialize its value to its own ID
if (msg.value == 0) Value(vertexId)
// Otherwise received value is initial ID
else Value(msg.value)
}
// How vertexes should send messages
def sendMsg(triplet: EdgeTriplet[Value, Int]): Iterator[(VertexId, Message)] = {
// For the triplet only single message shall be sent to destination vertex
// Its payload is source vertex origin ID
Iterator((triplet.dstId, Message(triplet.srcAttr.originId)))
}
// How incoming messages to one vertex should be merged
def mergeMsg(msg1: Message, msg2: Message): Message = {
// Generally for this case it's an error
// Because one ID can't have 2 different originIDs
msg2 // Just return any of the incoming messages
}
// Kick out pregel calculation
val res = graph
.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(onMsgReceive, sendMsg, mergeMsg)
// Print results
res.vertices.collect().foreach(println)
}
}
输出:(finalId firstId)
(100,Value(100))
(4,Value(1))
(300,Value(100))
(200,Value(100))
(40,Value(30))
(20,Value(10))
(1,Value(1))
(30,Value(30))
(10,Value(10))
(2,Value(1))
(3,Value(1))
(31,Value(10))