Question

我正在使用http://twistedmatrix.com/pipermail/twisted-python/2016-September/030783.html中的示例字数统计代码。在flatmap中生成的RDD上应用的函数“toDebugString”的输出与Python（pySpark）和Scala Apache Spark中的映射操作之间存在差异。

Python（pySpark）代码=＆gt;

text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
countsRDD = text_file.flatMap(lambda line: line.split(" "))
mapRDD = countsRDD.map(lambda word: (word, 1))
reducedRDD = mapRDD.reduceByKey(lambda a, b: a + b)
print(mapRDD.toDebugString())

输出：

(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at NativeMethodAccessorImpl.java:0 []

Scala代码=＆gt;

val text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
val countsRDD = text_file.flatMap(line => line.split(" "))
val mapRDD = countsRDD.map(word => (word, 1))
val reducedRDD = mapRDD.reduceByKey(_ + _)
print(mapRDD.toDebugString)

输出：

(1) MapPartitionsRDD[3] at map at Test.scala:67 []
 |  MapPartitionsRDD[2] at flatMap at Test.scala:66 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md 
MapPartitionsRDD[1] at textFile at Test.scala:65 []
 |  /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at 
textFile at Test.scala:65 []

Scala输出显示由于flatMap和Map操作而生成的两个不同的RDD。另一方面，python输出不显示这些操作，更重要的是，它只显示一个PythonRDD [3]。我认为，PythonRDD [3]是由于map操作而生成的，但是后来不是PythonRDD [3]依赖于前面的父RDD，PythonRDD [2]，由于flatMap操作生成，如Scala输出所示。

有没有办法追踪这些缺失的链接？或者pySpark的内部行为与Scala spark不同吗？

Scala toDebugString和Python toDebugString输出的不同输出

0 个答案: