我有一个从documentId映射到实体的文件,我提取文档共现。实体RDD看起来像这样:
//documentId -> (name, type, frequency per document)
val docEntityTupleRDD: RDD[(Int, Iterable[(String, String, Int)])]
要提取每个文档中实体及其频率之间的关系,我使用以下代码:
def hashId(str: String) = {
Hashing.md5().hashString(str, Charsets.UTF_8).asLong()
}
val docRelTupleRDD = docEntityTupleRDD
//flatMap at SampleGraph.scala:62
.flatMap { case(docId, entities) =>
val entitiesWithId = entities.map { case(name, _, freq) => (hashId(name), freq) }.toList
val relationships = entitiesWithId.combinations(2).collect {
case Seq((id1, freq1), (id2, freq2)) if id1 != id2 =>
// Make sure left side is less than right side
val (first, second) = if (id1 < id2) (id1, id2) else (id2, id1)
((first, second), (docId.toInt, freq1 * freq2))
}
relationships
}
val zero = collection.mutable.Map[Int, Int]()
val edges: RDD[Edge[immutable.Map[Int, Int]]] = docRelTupleRDD
.aggregateByKey(zero)(
(map, v) => map += v,
(map1, map2) => map1 ++= map2
)
.map { case ((e1, e2), freqMap) => Edge(e1, e2, freqMap.toMap) }
每条边将每个文档的关系频率存储在Map中。当我试图将边写入文件时:
edges.saveAsTextFile(outputFile + "_edges")
一段时间后我收到以下错误:
15/12/28 02:39:40 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 127198 ms exceeds timeout 120000 ms
15/12/28 02:39:40 ERROR TaskSchedulerImpl: Lost executor driver on localhost: Executor heartbeat timed out after 127198 ms
15/12/28 02:39:40 INFO TaskSetManager: Re-queueing tasks for driver from TaskSet 0.0
15/12/28 02:42:50 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 1 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:42:50 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): ExecutorLostFailure (executor driver lost)
15/12/28 02:43:55 ERROR TaskSetManager: Task 2 in stage 0.0 failed 1 times; aborting job
15/12/28 02:46:04 WARN TaskSetManager: Lost task 5.0 in stage 0.0 (TID 5, localhost): ExecutorLostFailure (executor driver lost)
[...]
15/12/28 02:47:07 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/12/28 02:48:36 WARN AkkaRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@64c6e4c4,BlockManagerId(driver, localhost, 35375))] in 2 attempts
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
15/12/28 02:49:39 INFO TaskSchedulerImpl: Cancelling stage 0
15/12/28 02:49:39 INFO DAGScheduler: ShuffleMapStage 0 (flatMap at SampleGraph.scala:62) failed in 3321.145 s
15/12/28 02:51:06 WARN SparkContext: Killing executors is only supported in coarse-grained mode
[...]
我的火花配置如下所示:
val conf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.setAppName("wordCount")
.setMaster("local[8]")
.set("spark.executor.memory", "8g")
.set("spark.driver.maxResultSize", "8g")
// Increase memory fraction to prevent disk spilling
.set("spark.shuffle.memoryFraction", "0.3")
// Disable spilling
// If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
// This spilling threshold is specified by spark.shuffle.memoryFraction.
.set("spark.shuffle.spill", "false")
我已经增加了执行程序内存,并在互联网上research之后使用reduceByKey
重构了之前的aggregateByKey
结构。错误保持不变。有人能帮助我吗?