Spark Shuffle配置:磁盘溢出评估

时间:2016-05-01 09:32:11

标签: scala apache-spark apache-spark-sql

以下是在以下代码的last stage处生成的快照:

df = /*a dataframe with 800+ partitions having 26+ GB of data*/
val partitionedDf = df.repartition(col("myColumn"))
partitionedDf.write.partitionBy("myColumn").parquet("output/path/")

在<罕见<稀疏的情况下,所有数据都移至单partition,而task则需要加载整个数据。

enter image description here

我的Spark configuration是:

"spark.master" :"yarn-client",
"spark.yarn.am.memory": "5G",
"spark.executor.memory" : "8G",
"spark.executor.cores" : "1",
"spark.yarn.executor.memoryOverhead":"2048",
"spark.core.connection.ack.wait.timeout" : "600",
"spark.rdd.compress" : "false",
"spark.executor.instances" : "6",
"spark.sql.shuffle.partitions" : "8",
"spark.hadoop.parquet.enable.summary-metadata" : "false"

问题

由于executor无法容纳26 + G,我期待I / O上有shuffle spill。在Summary Metrics的{​​{1}}部分中,我在版本SparkUI中看到了1.5.2Shuffle spill (memory)的行。如何确保版本Shuffle spill (disk)中究竟发生了什么?

0 个答案:

没有答案