Question

我目前正在使用spark进行一些实验，以便更好地了解在使用高级联查询时我可以期待的性能。

在我看来，在中间结果上调用persist（）（或cache（））会导致我的执行时间呈指数级增长。

考虑这个最小的例子：

SparkSession spark = SparkSession.builder()
        .appName(getClass().getName())
        .master("local[*]")
        .getOrCreate();

Dataset ds = spark.range(1);
for (int i = 1; i < 200; i++) {
    ds = ds.withColumn("id", ds.col("id"));

    ds.cache();

    long t0 = System.currentTimeMillis();
    long cnt = ds.count();
    long t1 = System.currentTimeMillis();

    System.out.println("Iteration " + String.format("%3d", i) + " count: " + cnt + " time: " + (t1 - t0) + "ms");
}

如果代码中没有ds.cache（），则count（）的时间相当不变。使用ds.cache（），但执行时间开始呈指数级增长：

iteration   without cache()   with cache()
      ...               ...            ...
       24                61            297
       25                74            515
       26                86          1.036
       27                78          1.904
       28                73          3.233
       29                79          6.815
       30                75         12.549
       31               107         26.379
       32                69         46.207
       33                54        102.172

知道这里发生了什么吗？根据我的理解，.persist的作用，这并没有多大意义。

谢谢，
的Thorsten

Answer 1

缓存不是免费午餐。它需要昂贵的状态管理，可能的缓存逐出（Dataset的默认级别为MEMORY_AND_DISK，因此在驱逐数据被写入磁盘的情况下），以及可能的垃圾收集周期。

查看相关问题Spark: Why do i have to explicitly tell what to cache?
在提供的场景中，缓存完全没用，您的代码不会测量任何内容。
- 由于您所做的只是计数，所以使用withColumn生成的所有表达式都将从执行计划中删除，除非cache禁止。
- 如果它被迫评估（foreach(_ => ())是一种方法），所有操作都可以合并到一个阶段，因为之前的阶段被覆盖并从计划中消除，因此缓存在这里没有价值。

Dataset.persist对性能的负面影响

1 个答案: