来自相同来源的RDD合并时出现重复的RDD

时间:2019-02-07 14:26:04

标签: apache-spark pyspark directed-acyclic-graphs

r00 = sc.parallelize(range(9))
r01 = sc.parallelize(range(0,90,10))
r10 = r00.cartesian(r01)
r11 = r00.map(lambda n : (n, n))
r12 = r00.zip(r01)
r13 = r01.keyBy(lambda x : x / 20)
r20 = r11.union(r12).union(r13).union(r10)
r20.collect()

之前的pyspark块代码给出了以下Job DAG:

Job DAG

但是,作业的阶段DAG正在显示PythonRDD中的几个ParallelCollectionRDD,即使它们相同(例如ParallelCollectionRDD [0]具有PythonRDD [2],{{1} }和PythonRDD [5]

Stage DAG

为什么PythonRDD [8]存在?为什么不从PythonRDDParallelCollectionRDDUnionRDDZippedPartitionRDD的直接连接?

0 个答案:

没有答案