我对火花缓存行为有点困惑。 我想计算相关数据集(b),缓存它和unpersist源数据集(a) - 这是我的代码:
val spark = SparkSession.builder().appName("test").master("local[4]").getOrCreate()
import spark.implicits._
val a = spark.createDataset(Seq(("a", 1), ("b", 2), ("c", 3)))
a.createTempView("a")
a.cache
println(s"Is a cached: ${spark.catalog.isCached("a")}")
val b = a.filter(x => x._2 < 3)
b.createTempView("b")
// calling action
b.cache.first
println(s"Is b cached: ${spark.catalog.isCached("b")}")
spark.catalog.uncacheTable("a")
println(s"Is b cached after a was unpersisted: ${spark.catalog.isCached("b")}")
使用spark 2.0.2时,它按预期工作:
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: true
但是在2.1.1:
Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: false
我怎样才能在2.1.1中实现相同的行为?
谢谢。
答案 0 :(得分:1)
我不知道应该怎么做。根据测试,在Spark 2.1.1中它按预期工作,但有一些注释反映了一些疑问。也许你可以在Spark项目中打开一个JIRA来澄清这种情况。
CachedTableSuite.scala
test("uncaching temp table") {
testData.select('key).createOrReplaceTempView("tempTable1")
testData.select('key).createOrReplaceTempView("tempTable2")
spark.catalog.cacheTable("tempTable1")
assertCached(sql("SELECT COUNT(*) FROM tempTable1"))
assertCached(sql("SELECT COUNT(*) FROM tempTable2"))
// Is this valid?
spark.catalog.uncacheTable("tempTable2")
// Should this be cached?
assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0)
}
assertCached方法检查numCachedTables等于第二个参数。
QueryTest.scala
/**
* Asserts that a given [[Dataset]] will be executed using the given number of cached results.
*/
def assertCached(query: Dataset[_], numCachedTables: Int = 1): Unit = {
val planWithCaching = query.queryExecution.withCachedData
val cachedData = planWithCaching collect {
case cached: InMemoryRelation => cached
}
assert(
cachedData.size == numCachedTables,
s"Expected query to contain $numCachedTables, but it actually had ${cachedData.size}\n" +
planWithCaching)
}