处理Apache Spark GraphX多个子图

时间:2015-09-02 15:40:11

标签: scala apache-spark spark-graphx

我有一个父图,我想过滤到多个子图,所以我可以将函数应用于每个子图并提取一些数据。我的代码如下所示:

val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = ...
val myEdges = ...
val myGraph = Graph(myVertices, myEdges)

val myResults : RDD[(<Tuple>)] = myTerms.map { x => mySubgraphFunction(myGraph, x) }

其中mySubgraphFunction是一个创建子图的函数,执行计算并返回结果数据的元组。

当我运行它时,我在mySubgraphFunction调用GraphX.subgraph时得到一个Java空指针异常。如果我在条款的RDD上调用collect,我可以使其工作(还在RDD上添加了性能):

val myTerms = <RDD of terms I want to use to filter the graph>
val myVertices = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myEdges = <read RDD>.persist(StorageLevel.MEMORY_ONLY_SER)
val myGraph = Graph(myVertices, myEdges)

val myResults : Array[(<Tuple>)] = myTerms.collect().map { x =>
                 mySubgraphFunction(myGraph, x) }

有没有办法让我在不必调用collect()的地方工作(即将其作为分布式操作)?我正在创建~1k子图,性能很慢。

0 个答案:

没有答案