Spark Graphx:EMR集群上未找到类错误

时间:2018-03-22 16:33:31

标签: scala apache-spark rdd spark-graphx

我正在尝试使用Grapghx Pregel处理分层数据,而我在本地的代码工作正常。

但是当我在我的Amazon EMR群集上运行时,它给了我一个错误:

java.lang.NoClassDefFoundError: Could not initialize class

发生这种情况的原因是什么?我知道这个类在jar文件中,因为它在我的本地运行良好,并且没有构建错误。

我在pom文件中包含了GraphX依赖项。

以下是抛出错误的代码片段:

def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))]  =
 {

  val verticesRDD = vertexDF.rdd
                  .map { x => (x.get(0), x.get(1), x.get(2)) }
                  .map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }

//create the edge RD top down relationship
  val EdgesRDD =  edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
                 .map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }

// create the edge RD top down relationship
 val graph = Graph(verticesRDD, EdgesRDD).cache()
  //val pathSeperator = """/"""

//initialize id,level,root,path,iscyclic, isleaf
  val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
  val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
  val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
  val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
  hrchyOutRDD
}

我能够缩小导致错误的行:

val hrchyRDD = initialGraph.pregel(initialMsg,Int.MaxValue,EdgeDirection.Out)(setMsg,sendMsg,mergeMsg)

1 个答案:

答案 0 :(得分:0)

我也遇到了同样的问题,当我从spark-submit执行时,我能够在spark-shell失败的情况下运行它。这是我尝试执行的代码中的example(看起来与您的代码相同)

向我指出正确解决方案的错误是:

org.apache.spark.SparkException: A master URL must be set in your configuration

在我的情况下,由于在主函数之外定义了SparkContext,我遇到了该错误:

object Test {        
     val sc = SparkContext.getOrCreate
     val sqlContext = new SQLContext(sc)
     def main(args: Array[String]) {
                 ...
     }
}

我能够通过将主函数内的SparkContext和sqlContext移动为described in this other post来解决此问题