Question

我正在使用Spark GraphX。我正在从文件（大约620 mb，5万个顶点和近5000万条边）构建图形。我正在使用带有以下组件的Spark集群：4个工人，每个工人具有8个核心和13.4g的ram，1个具有相同规格的驱动程序。当我将.jar提交给集群时，其中一名工作人员会随机加载所有数据。计算所需的所有任务都将请求给该工作人员。在计算其余三个时，它们什么也不做。我已经尝试了所有方法，但没有发现可以强制所有工人进行计算的任何事物。

当Spark构建图形时，我寻找顶点的RDD的分区数为5，但是如果我重新划分该RDD（例如，具有32（总共核数）），Spark会在每个工作程序中加载数据，但会减慢计算速度。

我通过这种方式启动Spark提交：

spark-submit --master spark://172.30.200.20:7077 --driver-memory 12g --executor-memory 12g --class interscore.InterScore /root/interscore/interscore.jar hdfs://172.30.200.20:9000/user/hadoop/interscore/network.dat hdfs://172.30.200.20:9000/user/hadoop/interscore/community.dat 111

代码在这里：

object InterScore extends App{
  val sparkConf = new SparkConf().setAppName("Big-InterScore")
  val sc = new SparkContext(sparkConf)

  val t0 = System.currentTimeMillis
  runInterScore(args(0), args(1), args(2))
  println("Running time " + (System.currentTimeMillis - t0).toDouble / 1000)

  sc.stop()

  def runInterScore(netPath:String, communitiesPath:String, outputPath:String) = {
    val communities = sc.textFile(communitiesPath).map(x => {
      val a = x.split('\t')
      (a(0).toLong, a(1).toInt)
    }).cache

    val graph = GraphLoader.edgeListFile(sc, netPath, true)
      .partitionBy(PartitionStrategy.RandomVertexCut)
      .groupEdges(_ + _)
      .joinVertices(communities)((_, _, c) => c)
      .cache

    val lvalues = graph.aggregateMessages[Double](
      m => {
          m.sendToDst(if (m.srcAttr != m.dstAttr) 1 else 0)
          m.sendToSrc(if (m.srcAttr != m.dstAttr) 1 else 0)
      }, _ + _)

    val communitiesIndices = communities.map(x => x._2).distinct.collect
    val verticesWithLValue = graph.vertices.repartition(32).join(lvalues).cache
    println("K = " + communitiesIndices.size)
    graph.unpersist()
    graph.vertices.unpersist()
    communitiesIndices.foreach(c => {
    //COMPUTE c
      }
    })
  }
}

为什么要在一个执行程序中对所有数据进行分区？

0 个答案: