Spark缓存2.0.2和2.1.1之间的差异

时间:2017-07-29 14:46:55

标签: scala apache-spark

我对火花缓存行为有点困惑。 我想计算相关数据集(b),缓存它和unpersist源数据集(a) - 这是我的代码:

val spark = SparkSession.builder().appName("test").master("local[4]").getOrCreate()
import spark.implicits._
val a = spark.createDataset(Seq(("a", 1), ("b", 2), ("c", 3)))
a.createTempView("a")
a.cache
println(s"Is a cached: ${spark.catalog.isCached("a")}")
val b = a.filter(x => x._2 < 3)
b.createTempView("b")
// calling action
b.cache.first
println(s"Is b cached: ${spark.catalog.isCached("b")}")

spark.catalog.uncacheTable("a")
println(s"Is b cached after a was unpersisted: ${spark.catalog.isCached("b")}")

使用spark 2.0.2时,它按预期工作:

Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: true

但是在2.1.1:

Is a cached: true
Is b cached: true
Is b cached after a was unpersisted: false

我怎样才能在2.1.1中实现相同的行为?

谢谢。

1 个答案:

答案 0 :(得分:1)

我不知道应该怎么做。根据测试,在Spark 2.1.1中它按预期工作,但有一些注释反映了一些疑问。也许你可以在Spark项目中打开一个JIRA来澄清这种情况。

CachedTableSuite.scala

test("uncaching temp table") {
  testData.select('key).createOrReplaceTempView("tempTable1")
  testData.select('key).createOrReplaceTempView("tempTable2")
  spark.catalog.cacheTable("tempTable1")

  assertCached(sql("SELECT COUNT(*) FROM tempTable1"))
  assertCached(sql("SELECT COUNT(*) FROM tempTable2"))

  // Is this valid?
  spark.catalog.uncacheTable("tempTable2")

  // Should this be cached?
  assertCached(sql("SELECT COUNT(*) FROM tempTable1"), 0)
}

assertCached方法检查numCachedTables等于第二个参数。

QueryTest.scala

/**
 * Asserts that a given [[Dataset]] will be executed using the given number of cached results.
 */
def assertCached(query: Dataset[_], numCachedTables: Int = 1): Unit = {
  val planWithCaching = query.queryExecution.withCachedData
  val cachedData = planWithCaching collect {
    case cached: InMemoryRelation => cached
  }

  assert(
    cachedData.size == numCachedTables,
    s"Expected query to contain $numCachedTables, but it actually had ${cachedData.size}\n" +
    planWithCaching)
}