Question

Pyspark使用cProfile并根据RDD API的文档工作，但是似乎没有办法在运行一堆DataFrame API操作后让分析器打印结果？

from pyspark import SparkContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
rdd = sc.parallelize([('a', 0), ('b', 1)])
df = sqlContext.createDataFrame(rdd)
rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out
sc.show_profiles()  # here prints nothing (no new profiling to show)

rdd.count()         # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out in DataFrame API

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!

# and again it works when converting to RDD but not 

df.rdd.count()      # this ACTUALLY gets profiled :)
sc.show_profiles()  # here is where the profiling prints out

df.count()          # why does this NOT get profiled?!?
sc.show_profiles()  # prints nothing?!

Answer 1

这是预期的行为。

不同于RDD API提供本地Python逻辑，DataFrame / SQL API是JVM本机。除非您调用Python udf *（包括pandas_udf），否则在工作计算机上不会执行任何Python代码。在Python端完成的全部工作就是通过Py4j网关进行简单的API调用。

因此，不存在任何分析信息。

*请注意，udf似乎也被排除在配置文件之外。

PySpark show_profile（）使用DataFrame API操作不打印任何内容

1 个答案: