Spark:GraphX无法在具有少量边和长路径的图中找到连接的组件

时间:2016-04-27 14:07:34

标签: apache-spark spark-graphx connected-components

我是Spark和GraphX的新手,并使用其算法进行了一些实验来查找连接的组件。我注意到图表的结构似乎对性能有很大的影响。

能够计算具有数百万个顶点和边的图,但对于某组图,算法没有及时完成,但最终以onLoggingImpression(Ad)失败。

该算法似乎存在包含长路径的图形的问题。例如,对于此图OutOfMemoryError: GC overhead limit exceeded,计算失败。但是,当我添加传递边时,计算立即完成:

{ (i,i+1) | i <- {1..200} }

这样的图表也没问题:

{ (i,j) | i <- {1..200}, j <- {i+1,200} }

以下是重现问题的最小示例:

{ (i,1) | i <- {1..200} }

import org.apache.spark._ import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ import org.apache.spark.storage.StorageLevel import scala.collection.mutable object Matching extends Logging { def main(args: Array[String]): Unit = { val fname = "input.graph" val optionsList = args.drop(1).map { arg => arg.dropWhile(_ == '-').split('=') match { case Array(opt, v) => opt -> v case _ => throw new IllegalArgumentException("Invalid argument: " + arg) } } val options = mutable.Map(optionsList: _*) val conf = new SparkConf() GraphXUtils.registerKryoClasses(conf) val partitionStrategy: Option[PartitionStrategy] = options.remove("partStrategy") .map(PartitionStrategy.fromString(_)) val edgeStorageLevel = options.remove("edgeStorageLevel") .map(StorageLevel.fromString(_)).getOrElse(StorageLevel.MEMORY_ONLY) val vertexStorageLevel = options.remove("vertexStorageLevel") .map(StorageLevel.fromString(_)).getOrElse(StorageLevel.MEMORY_ONLY) val sc = new SparkContext(conf.setAppName("ConnectedComponents(" + fname + ")")) val unpartitionedGraph = GraphLoader.edgeListFile(sc, fname, edgeStorageLevel = edgeStorageLevel, vertexStorageLevel = vertexStorageLevel).cache() log.info("Loading graph...") val graph = partitionStrategy.foldLeft(unpartitionedGraph)(_.partitionBy(_)) log.info("Loading graph...done") log.info("Computing connected components...") val cc = ConnectedComponents.run(graph) log.info("Computed connected components...done") sc.stop() } } 文件可以看到这个(10个节点,连接它们的9条边):

input.graph

当它失败时,它会挂起1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 。错误消息是:

ConnectedComponents.run(graph)

我正在运行本地Spark节点并使用以下选项启动JVM:

Exception in thread "dag-scheduler-event-loop" java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.regex.Pattern.compile(Pattern.java:1054)
    at java.lang.String.replace(String.java:2239)
    at org.apache.spark.util.Utils$.getFormattedClassName(Utils.scala:1632)
    at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:58)
    at org.apache.spark.storage.RDDInfo$$anonfun$1.apply(RDDInfo.scala:58)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:58)
    at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:80)
    at org.apache.spark.scheduler.StageInfo$$anonfun$1.apply(StageInfo.scala:80)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at org.apache.spark.scheduler.StageInfo$.fromStage(StageInfo.scala:80)
    at org.apache.spark.scheduler.Stage.<init>(Stage.scala:99)
    at org.apache.spark.scheduler.ShuffleMapStage.<init>(ShuffleMapStage.scala:44)
    at org.apache.spark.scheduler.DAGScheduler.newShuffleMapStage(DAGScheduler.scala:317)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$newOrUsedShuffleStage(DAGScheduler.scala:352)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage$1.apply(DAGScheduler.scala:286)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage$1.apply(DAGScheduler.scala:285)
    at scala.collection.Iterator$class.foreach(Iterator.scala:742)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.mutable.Stack.foreach(Stack.scala:170)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:285)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:389)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$visit$1$1.apply(DAGScheduler.scala:386)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.scheduler.DAGScheduler.visit$1(DAGScheduler.scala:386)
    at org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:398)

你能帮助我理解为什么它有这个玩具图(201个节点和200个边缘)的问题,但另一方面可以解决一个在80秒内有数百万个边缘的真实图形? (在这两个示例中,我使用相同的设置和配置。)

更新

也可以在spark-shell中复制:

-Dspark.master=local -Dspark.local.dir=/home/phil/tmp/spark-tmp -Xms8g -Xmx8g

我创建了一个错误报告:SPARK-15042

1 个答案:

答案 0 :(得分:0)

根据SPARK-15042,问题仍然存在于2.1.0-snapshot中。

可以在SPARK-5484中看到修复错误的进度。