Question

以下是步骤：

scala> val df = sql("select * from table")
df: org.apache.spark.sql.DataFrame = [num: int]

scala> df.cache
res13: df.type = [num: int]

scala> df.collect
res14: Array[org.apache.spark.sql.Row] = Array([10], [10])

scala> df
res15: org.apache.spark.sql.DataFrame = [num: int]

scala> df.show
+---+
|num|
+---+
| 10|
| 10|
+---+


scala> sql("truncate table table")
res17: org.apache.spark.sql.DataFrame = []

scala> df.show
+---+
|num|
+---+
+---+

我的问题是为什么df被刷新了？我的期望是它应该缓存在内存中，截断不应该删除数据。

任何想法都会受到高度赞赏。

由于

Answer 1

你永远不应该依赖cache的正确性。 Spark cache是性能优化，即使是最具防御性的StorageLevel（MEMORY_AND_DISK_SER_2），也不保证在工作人员失败，执行者退役或资源不足的情况下保留数据。

与您的问题中使用的代码类似的代码可能在某些情况下有效，但不要假设它是有保证的或确定性的行为。

Answer 2

truncate table命令删除缓存的数据，然后解除并清空表。 HERE是truncate的来源。如果您使用指向TruncateTableCommand源代码的链接，则在案例类的底部，您将看到以下有关在截断表时如何处理缓存和表的信息：

// After deleting the data, invalidate the table to make sure we don't keep around a stale
// file relation in the metastore cache.
spark.sessionState.refreshTable(tableName.unquotedString)
// Also try to drop the contents of the table from the columnar cache
try {
  spark.sharedState.cacheManager.uncacheQuery(spark.table(table.identifier))
} catch {
  case NonFatal(e) =>
    log.warn(s"Exception when attempting to uncache table $tableIdentWithDB", e)
}

if (table.stats.nonEmpty) {
  // empty table after truncation
  val newStats = CatalogStatistics(sizeInBytes = 0, rowCount = Some(0))
  catalog.alterTableStats(tableName, Some(newStats))
}
Seq.empty[Row]

截断表后刷新的缓存数据帧

2 个答案: