Scala toDebugString和Python toDebugString输出的不同输出

时间:2017-10-19 04:30:11

标签: python scala apache-spark pyspark

我正在使用http://twistedmatrix.com/pipermail/twisted-python/2016-September/030783.html中的示例字数统计代码。 在flatmap中生成的RDD上应用的函数“toDebugString”的输出与Python(pySpark)和Scala Apache Spark中的映射操作之间存在差异。

Python(pySpark)代码=>

text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
countsRDD = text_file.flatMap(lambda line: line.split(" "))
mapRDD = countsRDD.map(lambda word: (word, 1))
reducedRDD = mapRDD.reduceByKey(lambda a, b: a + b)
print(mapRDD.toDebugString())

输出:

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []

Scala代码=>

val text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
val countsRDD = text_file.flatMap(line => line.split(" "))
val mapRDD = countsRDD.map(word => (word, 1))
val reducedRDD = mapRDD.reduceByKey(_ + _)
print(mapRDD.toDebugString)

输出:

(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []

Scala输出显示由于flatMap和Map操作而生成的两个不同的RDD。另一方面,python输出不显示这些操作,更重要的是,它只显示一个PythonRDD [3]。我认为,PythonRDD [3]是由于map操作而生成的,但是后来不是PythonRDD [3]依赖于前面的父RDD,PythonRDD [2],由于flatMap操作生成,如Scala输出所示。

有没有办法追踪这些缺失的链接?或者pySpark的内部行为与Scala spark不同吗?

0 个答案:

没有答案