保存我的Apache Spark管道的中间状态

时间:2017-08-25 21:23:01

标签: python apache-spark pyspark bigdata

我有一个相当复杂的Apache PySpark管道,它对(非常大的)一组文本文件执行几次转换。我的管道的预期输出是管道的不同阶段。哪种方法最好(即效率更高,但更多闪闪发光的,更符合Spark编程模型和样式)?

现在,我的代码如下所示:

# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)

# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string. 
rdd..map(first_serialization).saveAsTextFile("ckpt1")

# here, I have to read again from the previously saved checkpoint
# using a `first_deserialization` function that deserializes what has
# been serialized from the `firs_serialization` function. Then performs
# other transformations.
rdd = ctx.textFile("ckpt1").map(...).map(...)

等等。我想摆脱序列化方法和多重保存/读取 - 顺便说一下,它是否会影响效率?我认为是的。

任何提示? 提前谢谢。

1 个答案:

答案 0 :(得分:1)

这看起来很简单,因为它是,但我建议写出中间阶段,同时继续重用现有的RDD(侧栏:使用数据集/数据帧而不是RDD来获得更多性能)并继续处理,写出来你去的中间结果。

当您已经处理了数据(理想情况下甚至是缓存!)以供进一步使用时,无需支付从磁盘/网络读取的罚款。

使用您自己的代码的示例:

# initialize the pipeline and perform the first set of transformations.
ctx = pyspark.SparkContext('local', 'MyPipeline')
rdd = ctx.textFile(...).map(...).map(...)

# first checkpoint: the `first_serialization` function serializes
# the data into properly formatted string. 
string_rdd = rdd..map(first_serialization)
string_rdd.saveAsTextFile("ckpt1")

# reuse the existing RDD after writing out the intermediate results
rdd = rdd.map(...).map(...) # rdd here is the same variable we used to create the string_rdd results above. alternatively, you may want to use the string_rdd variable here instead of the original rdd variable.