Question

我有一个从documentId映射到实体的文件，我提取文档共现。实体RDD看起来像这样：

//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]

要提取每个文档中实体及其频率之间的关系，我使用以下代码：

def hashId(str: String) = {
    Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}

val docRelTupleRDD = docEntityTupleRDD
  //flatMap at SampleGraph.scala:62
  .flatMap { case(docId, entities) =>
    val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
    val relationships = entitiesWithId.combinations(2).collect {
      case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
        // Make sure left side is less than right side
        val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
        ((first, second), (docId.toInt, freq1 * freq2))
    }
    relationships
  }


val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
  .aggregateByKey(zero)(
    (map, v) => map += v,
    (map1, map2) => map1 ++= map2
  )
  .map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }

每条边将每个文档的关系频率存储在Map中。当我试图将边写入文件时：

edges.saveAsTextFile(outputFile + "_edges")

一段时间后我收到以下错误：

15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]

我的火花配置如下所示：

val conf = new SparkConf()
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .setAppName("wordCount")
  .setMaster("local[8]")
  .set("spark.executor.memory", "8g")
  .set("spark.driver.maxResultSize", "8g")
  // Increase memory fraction to prevent disk spilling
  .set("spark.shuffle.memoryFraction", "0.3")
  // Disable spilling
  // If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
  // This spilling threshold is specified by spark.shuffle.memoryFraction.
  .set("spark.shuffle.spill", "false")

我已经增加了执行程序内存，并在互联网上research之后使用reduceByKey重构了之前的aggregateByKey结构。错误保持不变。有人能帮助我吗？

Apache Spark中的共现图RpcTimeoutException

0 个答案: