保存Spark DataFrame的逻辑计划或沿袭以进行重播

时间:2019-02-19 18:35:25

标签: apache-spark apache-spark-sql

是否可以保存或序列化Spark DataFrame的逻辑计划并重播它。例如,看下面的计划:

val df = spark.read.option("multiLine", true).json("/home/rtf.json").withColumn("double", col("ROW_ID") * 2)
df.explain
== Physical Plan ==
*Project [ROW_ID#0L, TEXT#1, (ROW_ID#0L * 2) AS double#5L]
+- *FileScan json [ROW_ID#0L,TEXT#1] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/home/rtf.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ROW_ID:bigint,TEXT:string>
df.count
res1: Long = 10

我想要做的是对该计划进行快照,这样,如果我去往/home/rtf.json并添加一行,就可以像这样重播它:

val newDF = spark.plan.apply("path_to_saved_plan")
newDF.explain
    == Physical Plan ==
    *Project [ROW_ID#0L, TEXT#1, (ROW_ID#0L * 2) AS double#5L]
    +- *FileScan json [ROW_ID#0L,TEXT#1] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/home/rtf.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<ROW_ID:bigint,TEXT:string>
newDF.count
res2: Long = 11 // Increased!

...使用相同的逻辑计划生产数据框,但包括新行。

0 个答案:

没有答案