Question

我正在使用spark SQL数据帧，并且遇到了持续加速以后计算的问题。具体来说，当调用persist(StorageLevel.MEMORY_AND_DISK)然后检查Spark UI的“存储”选项卡时，我能够看到缓存的RDD，但存储级别始终显示Memory Deserialized 1x Replicated和“磁盘上的大小”列显示所有RDD的0.0 B.

我也试过MEMORY_AND_DISK_SER，但得到的结果相同。我很好奇是否有人看过这个，或者我在这里做错了什么。查看spark文档，显示在数据框上调用cache()或persist()默认为存储级别MEMORY_AND_DISK，并使用SQLContext中的cacheTable方法声明它Caches the specified table in-memory. 1}}

对于一些其他信息，我的程序流程的一般框架是：

// Here computeHeavyMethod is some code that returns a DataFrame
val tableData = computeHeavyMethod().persist(StorageLevel.MEMORY_AND_DISK)
tableData.write.mode(SaveMode.Overwrite).json(outputLocation)
tableData.createOrReplaceTempView(tableName)

spark.sql("Some sql statement that uses the table created above")

Answer 1

Documentation说：

MEMORY_AND_DISK

将RDD存储为JVM中的反序列化Java对象。如果RDD没有   适合内存，存储不适合磁盘的分区，并阅读   当他们需要时，他们就从那里开始。

因此，只有在耗尽（存储）内存

时才会使用磁盘存储

Answer 2

至少使用Spark 2.2.0，它看起来像＆＃34; disk＆＃34;仅在RDD完全溢出到磁盘时显示：

StorageLevel: StorageLevel(disk, 1 replicas); CachedPartitions: 36; 
TotalPartitions: 36; MemorySize: 0.0 B; DiskSize: 3.3 GB

对于部分溢出的RDD，StorageLevel显示为＆＃34;内存＆＃34;：

StorageLevel: StorageLevel(memory, deserialized, 1 replicas); 
CachedPartitions: 36; TotalPartitions: 36; MemorySize: 3.4 GB; DiskSize: 1158.0 MB

持久化数据框忽略StorageLevel

2 个答案: