Question

我正在加载大型数据集，然后在整个代码中缓存它们以供参考。代码看起来像这样：

val conversations = sqlContext.read
  .format("com.databricks.spark.redshift")
  .option("url", jdbcUrl)
  .option("tempdir", tempDir)
  .option("forward_spark_s3_credentials","true")
  .option("query", "SELECT * FROM my_table "+
                   "WHERE date <= '2017-06-03' "+
                   "AND date >= '2017-03-06' ")
  .load()
  .cache()

如果我不使用缓存，代码会快速执行，因为数据集会被懒惰地评估。但是如果我放入缓存（），则块需要很长时间才能运行。

从在线Spark UI的事件时间轴，似乎SQL表正在传输到工作节点，然后缓存在工作节点上。

为什么缓存立即执行？源代码似乎仅在计算数据时将其标记为缓存：

调用缓存或持久性时source code for Dataset调用此代码in CacheManager.scala：

  /**
   * Caches the data produced by the logical representation of the given [[Dataset]].
   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
   * recomputing the in-memory columnar representation of the underlying table is expensive.
   */
  def cacheQuery(
      query: Dataset[_],
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
    val planToCache = query.logicalPlan
    if (lookupCachedData(planToCache).nonEmpty) {
      logWarning("Asked to cache already cached data.")
    } else {
      val sparkSession = query.sparkSession
      cachedData.add(CachedData(
        planToCache,
        InMemoryRelation(
          sparkSession.sessionState.conf.useCompression,
          sparkSession.sessionState.conf.columnBatchSize,
          storageLevel,
          sparkSession.sessionState.executePlan(planToCache).executedPlan,
          tableName)))
    }
  }

仅显示标记缓存而不是实际缓存数据。我希望缓存能够立即根据Stack Overflow上的其他答案返回。

在对数据集执行action之前，是否有其他人看到过缓存？为什么会这样？

为什么在Spark数据集上调用缓存需要很长时间？

0 个答案: