Spark上的Graphx在循环中引发StackOverflowError异常

时间:2019-02-19 07:51:29

标签: scala apache-spark

我将Graphx和Scala一起使用进行一些计算。我需要从图形中删除一些顶点并循环几次。但是代码总是抛出java.lang.StackOverflowError

        var rawG = GraphLoader.edgeListFile(sc, inputFilePath,
            edgeStorageLevel = StorageLevel.MEMORY_AND_DISK, vertexStorageLevel = StorageLevel.MEMORY_AND_DISK)
        while (rawG.vertices.count() > 0) {
            val stepDegree = rawG.degrees.map(v => v._2).min()
            val g = rawG.joinVertices[Int](rawG.degrees)((_, _, newDeg) => newDeg)
            rawG = g.subgraph(vpred = (vid, nodeDegree) => {
                nodeDegree > stepDegree
            })
        }

初始图具有1亿个顶点。这个循环每次图形都会越来越小,但是StackOverflowError总是在100个循环之后出现。我猜想用 g.subgraph 创建一个新的子图,但是旧的图对象的内存不会立即释放,所以会导致异常吗?

我尝试将图形的storage_level设置为StorageLevel.MEMORY_AND_DISK,但会引发错误:“ 在已将RDD分配了级别之后,无法更改其RDD的存储级别

scala> g.checkpoint

scala> g
res15: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@79863795

scala> g.persist(StorageLevel.MEMORY_AND_DISK)
java.lang.UnsupportedOperationException: Cannot change storage level of an RDD after it was already assigned a level
  at org.apache.spark.rdd.RDD.persist(RDD.scala:170)
  at org.apache.spark.rdd.RDD.persist(RDD.scala:195)
  at org.apache.spark.graphx.impl.VertexRDDImpl.persist(VertexRDDImpl.scala:57)
  at org.apache.spark.graphx.impl.VertexRDDImpl.persist(VertexRDDImpl.scala:27)
  at org.apache.spark.graphx.impl.GraphImpl.persist(GraphImpl.scala:54)
  ... 53 elided

我阅读了graphx的代码,最终调用了RDD.persist(newLevel:StorageLevel)的Graph.persist方法,RDD.isLocallyCheckpointed始终返回false,因此RDD.persist()无法更改存储级别。

0 个答案:

没有答案