Question

我始终了解persist()和cache()，然后激活DAG的动作，将计算结果并将其保存在内存中以备后用。这里有很多线程会告诉您缓存以增强常用数据帧的性能。

最近我做了一个测试，感到很困惑，因为事实并非如此。

    temp_tab_name = "mytablename";
    x = spark.sql("select * from " +temp_tab_name +" limit 10");
    x = x.persist()
    x.count() #action to activate all the above steps
    x.show() #x should have been persisted in memory here, DAG evaluated, no going back to "select..." whenever referred to
    x.is_cached #True
    spark.sql("drop table "+ temp_tab_name);
    x.is_cached #Still true!!
    x.show() # Error, table not found here

因此，在我看来x从未被计算和持久化。对x的下一个引用仍然返回到评估其DAG定义"select..."。我在这里错过了什么吗？

Answer 1

cache和persist并未完全从源中分离计算结果。它只是尽力避免重新计算。因此，通常来说，在完成数据集之前删除源是一个坏主意。

在您的特定情况下（从我的头上来）可能出了什么问题：

~~1）show不需要表的所有记录，因此也许仅触发几个分区的计算。因此，目前仍未计算大多数分区。~~

2）spark需要表中的一些辅助信息（例如用于分区）

Answer 2

下面是正确的语法...这是一些有关“解开”表的附加文档=> https://spark.apache.org/docs/latest/sql-performance-tuning.html ...您可以在Spark UI中的“存储”选项卡下确认以下示例，以查看对象被“缓存”和“未缓存”

"taxi"

Spark持久性（）（然后采取行动）真的持久吗？

2 个答案: