Question

我有一个父图，我想过滤到多个子图，所以我可以将函数应用于每个子图并提取一些数据。我的代码如下所示：

val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = ...
val myEdges = ...
val myGraph = Graph(myVertices, myEdges)

val myResults : RDD[(<Tuple>)] = myTerms.map { x => mySubgraphFunction(myGraph, x) }

其中mySubgraphFunction是一个创建子图的函数，执行计算并返回结果数据的元组。

当我运行它时，我在mySubgraphFunction调用GraphX.subgraph时得到一个Java空指针异常。如果我在条款的RDD上调用collect，我可以使其工作（还在RDD上添加了性能）：

val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myEdges = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myGraph = Graph(myVertices, myEdges)

val myResults : Array[(<Tuple>)] = myTerms.collect().map { x =>
                 mySubgraphFunction(myGraph, x) }

有没有办法让我在不必调用collect（）的地方工作（即将其作为分布式操作）？我正在创建~1k子图，性能很慢。

处理Apache Spark GraphX多个子图

0 个答案: