据我所见,rdd.toDF()
引入了PythonRDD
,它在查询计划中成为ExistingRDD
。
df1 = spark.range(100, numPartitions=5)
df2 = df1.rdd.toDF()
print(df1.rdd.toDebugString())
# (5) MapPartitionsRDD[2097] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2096] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2095] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2094] at javaToPython at <unknown>:0 []
# | ParallelCollectionRDD[2093] at javaToPython at <unknown>:0 []
print(df2.rdd.toDebugString())
# (5) MapPartitionsRDD[2132] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2131] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2130] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2129] at applySchemaToPythonRDD at <unknown>:0 []
# | MapPartitionsRDD[2128] at map at SerDeUtil.scala:137 []
# | MapPartitionsRDD[2127] at mapPartitions at SerDeUtil.scala:184 []
# | PythonRDD[2126] at RDD at PythonRDD.scala:53 []
# | MapPartitionsRDD[2097] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2096] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2095] at javaToPython at <unknown>:0 []
# | MapPartitionsRDD[2094] at javaToPython at <unknown>:0 []
# | ParallelCollectionRDD[2093] at javaToPython at <unknown>:0 []
如果我使用DataFrame缓存df1.cache()
,spark SQL足够聪明,可以在查询中使用等效的RDD。
spark.range(100, numPartitions=5).groupby().count().explain()
# == Physical Plan ==
# *(2) HashAggregate(keys=[], functions=[count(1)])
# +- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
# +- *(1) InMemoryTableScan
# +- InMemoryRelation [id#2525L], StorageLevel(disk, memory, deserialized, 1 replicas)
# +- *(1) Range (0, 100, step=1, splits=5)
但是,ExistingRDD
并没有从中受益。
df2.groupby().count().explain()
# == Physical Plan ==
# *(2) HashAggregate(keys=[], functions=[count(1)])
# +- Exchange SinglePartition
# +- *(1) HashAggregate(keys=[], functions=[partial_count(1)])
# +- *(1) Project
# +- Scan ExistingRDD[id#2573L]
Spark SQL优化器似乎无法通过ExistingRDD
进行跟踪。是真的吗?
如果我使用df1.rdd.cache().count()
是因为df2.rdd
是df1.rdd
的后代,那么它仍然可以从RDD缓存中受益吗?
我还想知道如果ExistingRDD
会给查询计划带来障碍,从而对性能造成不利影响,则会形成什么操作。