Question

我需要优化我的pyspark代码，以使执行计划尽可能并行；我想知道是否有比.explain方法更好的方法（不可读）来探索DAG，就像“正常”图形对象一样。

例如，了解阶段的总数，DAG的“第一级节点”的数量等非常有用。谢谢。

Answer 1

you can get a more detailed explain plan from catalyst optimizer by adding "True" .. perhaps this is what you are looking for

df = spark.range(10)
df.explain(True)
...output...
== Parsed Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint
Range (0, 10, step=1, splits=Some(8))

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8))

== Physical Plan ==
*(1) Range (0, 10, step=1, splits=8)

more detailed you can also access the Spark UI which provides a DAG visualization and breakdown of jobs, stages, tasks, cached objects, executor distribution, and environment variables ... you can access it via url 'driver_node_host:4040' which is the default port ... docs here for additional configurations => https://spark.apache.org/docs/latest/configuration.html#spark-ui

探索Spark执行计划，阶段数等

1 个答案: