我正在使用http://twistedmatrix.com/pipermail/twisted-python/2016-September/030783.html中的示例字数统计代码。 在flatmap中生成的RDD上应用的函数“toDebugString”的输出与Python(pySpark)和Scala Apache Spark中的映射操作之间存在差异。
Python(pySpark)代码=>
text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
countsRDD = text_file.flatMap(lambda line: line.split(" "))
mapRDD = countsRDD.map(lambda word: (word, 1))
reducedRDD = mapRDD.reduceByKey(lambda a, b: a + b)
print(mapRDD.toDebugString())
输出:
(1) PythonRDD[3] at RDD at PythonRDD.scala:48 []
| /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
| /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at
textFile at NativeMethodAccessorImpl.java:0 []
Scala代码=>
val text_file = sc.textFile("/Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md")
val countsRDD = text_file.flatMap(line => line.split(" "))
val mapRDD = countsRDD.map(word => (word, 1))
val reducedRDD = mapRDD.reduceByKey(_ + _)
print(mapRDD.toDebugString)
输出:
(1) MapPartitionsRDD[3] at map at Test.scala:67 []
| MapPartitionsRDD[2] at flatMap at Test.scala:66 []
| /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md
MapPartitionsRDD[1] at textFile at Test.scala:65 []
| /Users/Aarti/Documents/Fall2017/Code/spark-2.2.0/Readme.md HadoopRDD[0] at
textFile at Test.scala:65 []
Scala输出显示由于flatMap和Map操作而生成的两个不同的RDD。另一方面,python输出不显示这些操作,更重要的是,它只显示一个PythonRDD [3]。我认为,PythonRDD [3]是由于map操作而生成的,但是后来不是PythonRDD [3]依赖于前面的父RDD,PythonRDD [2],由于flatMap操作生成,如Scala输出所示。
有没有办法追踪这些缺失的链接?或者pySpark的内部行为与Scala spark不同吗?