我需要优化我的pyspark代码,以使执行计划尽可能并行;我想知道是否有比.explain方法更好的方法(不可读)来探索DAG,就像“正常”图形对象一样。
例如,了解阶段的总数,DAG的“第一级节点”的数量等非常有用。 谢谢。
答案 0 :(得分:1)
you can get a more detailed explain plan from catalyst optimizer by adding "True" .. perhaps this is what you are looking for
df = spark.range(10)
df.explain(True)
...output...
== Parsed Logical Plan ==
Range (0, 10, step=1, splits=Some(8))
== Analyzed Logical Plan ==
id: bigint
Range (0, 10, step=1, splits=Some(8))
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8))
== Physical Plan ==
*(1) Range (0, 10, step=1, splits=8)
more detailed you can also access the Spark UI which provides a DAG visualization and breakdown of jobs, stages, tasks, cached objects, executor distribution, and environment variables ... you can access it via url 'driver_node_host:4040' which is the default port ... docs here for additional configurations => https://spark.apache.org/docs/latest/configuration.html#spark-ui