Question

我有一个四节点hadoop集群（mapr），每个集群有40GB内存。我的火花启动参数如下：

MASTER="yarn-client" /opt/mapr/spark/spark-1.6.1/bin/pyspark --num-executors 8 --executor-memory 10g --executor-cores 5 --driver-memory 20g --driver-cores 10 --conf spark.driver.maxResultSize="0" --conf spark.default.parallelism="100"

现在，当我使用100K记录运行我的spark作业，并运行results.count（）或result.saveTable（）时，它将在所有8个执行程序上运行。但是，如果我使用1M记录运行作业，则作业将分为3个阶段，最后阶段仅在一个执行程序上运行。它是否与分区有关？

Answer 1

我通过将数据帧转换为rdd并将其重新分区为大于500的大值来解决此问题，而不是使用df.withColumn（）

伪代码：

df_rdd = df.rdd
df_rdd_partioned = df_rdd.repartition(1000)
df_rdd_partioned.cache().count()
result = df_rdd_partioned.map(lambda r: (r, transform(r)), preservesPartitioning=True).toDF()
result.cache()

Spark只为大型作业运行一个执行程序

1 个答案: