我正在尝试查看persist()
之后的rdd上的partitionBy
是否保存了后续操作,而spark ui似乎表明我没有保存任何操作。
如果persist
有效,我认为应该跳过第7阶段或第8阶段
(无论哪种方式,我的测试代码都可能是错误的,请让我知道。)
这是我正在使用的代码
from pyspark import SparkContext, SparkConf
from pyspark.rdd import portable_hash
from pyspark.sql import SparkSession, Row
from pyspark.storagelevel import StorageLevel
transactions = [
{'name': 'Bob', 'amount': 100, 'country': 'United Kingdom'},
{'name': 'James', 'amount': 15, 'country': 'United Kingdom'},
{'name': 'Marek', 'amount': 51, 'country': 'Poland'},
{'name': 'Johannes', 'amount': 200, 'country': 'Germany'},
{'name': 'Paul', 'amount': 75, 'country': 'Poland'},
]
conf = SparkConf().setAppName("word count4").setMaster("local[3]") sc = SparkContext(conf = conf)
lines = sc.textFile("in/word_count.text")
words = lines.flatMap(lambda line: line.split(" "))
rdd = words.map(lambda word: (word, 1))
rdd = rdd.partitionBy(4)
rdd = rdd.persist(StorageLevel.MEMORY_ONLY)
rdd = rdd.reduceByKey(lambda x, y: x+y)
for count, word in rdd.collect():
print("{} : {}".format(word, count))
rdd = rdd.sortByKey(ascending=False)
for count, word in rdd.collect():
print("{} : {}".format(word, count))
答案 0 :(得分:1)
您的期望不正确。如果您检查DAG
(4) PythonRDD[28] at collect at <ipython-input-15-a9f47c6b3258>:3 []
| MapPartitionsRDD[27] at mapPartitions at PythonRDD.scala:133 []
| ShuffledRDD[26] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(4) PairwiseRDD[25] at sortByKey at <ipython-input-15-a9f47c6b3258>:1 []
| PythonRDD[24] at sortByKey at <ipython-input-15-a9f47c6b3258>:1 []
| MapPartitionsRDD[20] at mapPartitions at PythonRDD.scala:133 []
| CachedPartitions: 4; MemorySize: 6.6 KB; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B
| ShuffledRDD[19] at partitionBy at NativeMethodAccessorImpl.java:0 []
+-(1) PairwiseRDD[18] at partitionBy at <ipython-input-13-fff304ea68c9>:6 []
| PythonRDD[17] at partitionBy at <ipython-input-13-fff304ea68c9>:6 []
| in/word_count.text MapPartitionsRDD[16] at textFile at NativeMethodAccessorImpl.java:0 []
| in/word_count.text HadoopRDD[15] at textFile at NativeMethodAccessorImpl.java:0 []
您将看到缓存的组件只是促成上述阶段的众多操作之一。而且,尽管确实可以重用缓存的数据,但仍然必须计算其余操作(为sortByKey
准备随机播放)。