SparkUI for pyspark - corresponding line of code for each stage?

时间:2016-07-11 20:08:02

标签: apache-spark pyspark emr

I have some pyspark program running on AWS cluster. I am monitoring the job through Spark UI (see attached). However, I noticed that unlike the scala or Java spark program, which shows each Stage is corresponding to which line of code, I can't find which Stage is corresponding to which line of code in the pyspark code.

Is there a way I can figure out which Stage is corresponding to which line of the pyspark code?

Thanks!

enter image description here

2 个答案:

答案 0 :(得分:0)

有没有办法确定哪个阶段与pyspark代码的哪一行相对应?

是的。 Spark UI在您的Python代码中提供了从PySpark操作调用的Scala方法。有了PySpark codebase,您可以轻松识别正在调用的PySpark方法。在您的示例中,if (!category) throw new Error("Category channel does not exist"); channel.setParent(category.id).then( channel.send(embed) 是不言自明的,对cache的快速搜索显示它是由PySpark DataFrame.rdd方法调用的。

答案 1 :(得分:0)

当您运行 toPandas 调用时,python 代码中的行显示在 SQL 选项卡中。其他收集命令,例如 count 或 parquet 不显示行号。我不确定为什么会这样,但我发现它非常方便。