Question

我正在构建一个相对简单的Spark应用程序。通常，逻辑看起来像这样：

val file1 = sc.textFile("s3://file1/*")
val file2 = sc.textFile("s3://file2/*")
// map over files
val file1Map = file1.map(word => (word, "val1"))
val file2Map = file2.map(differentword => (differentword, "val2"))
val unionRdd = file1Map.union(file2Map)
val groupedUnion = unionRdd.groupByKey()
val output = groupedUnion.map(tuple => {
    // do something that requires all the values, return new object
    if(oneThingIsTrue) tuple._1 else "null"
}).filter(line => line != "null")
output.saveAsTextFile("s3://newfile/")

当我使用更大的数据集运行它时，问题与此无关。当数据集大约为700GB时，我可以毫无错误地运行它。当我将它加倍到1.6TB时，工作将在超时前中途完成。这是Err日志：

INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@172.31.4.36:39743)
ERROR MapOutputTrackerWorker: Error communicating with MapOutputTracker
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [800 seconds]. This timeout is controlled by spark.network.timeout

我尝试将网络超时增加到800秒和1600秒，但所有这些都会延迟错误。我在10r4.2xl上运行代码，每个代码有8个内核和62GB内存。我有EBS设置有3TB存储空间。我在Amazon EMR中通过Zeppelin运行此代码。

任何人都可以帮我调试吗？群集的CPU使用率将在整个时间内接近90％，直到它到达中途并完全回落到0。另一个有趣的事情是它在第二阶段失败时看起来像是失败了。正如您从跟踪中看到的那样，它正在执行获取并且永远不会获取它。

这是Ganglia的照片。

Answer 1

我仍然不确定是什么造成了这种情况，但我能够通过合并unionRdd然后对该结果进行分组来绕过它。将上述代码更改为：

...
// union rdd is 30k partitions, coalesce into 8k
val unionRdd = file1Map.union(file2Map)
val col = unionRdd.coalesce(8000) 
val groupedUnion = col.groupByKey()
...

它可能效率不高，但它有效。

运行大型数据集会导致超时

1 个答案: