Question

我有下一个代码。我正在计算执行持久操作并修复上面的转换。但是我注意到DAG和2个不同计数作业的阶段首先持续两次（当我希望在第二次计数调用中调用第二个持久化方法时）

val df = sparkSession.read
      .parquet(bigData)
      .filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
      .as[SafegraphRawData]
      // So repartition here to be able perform shuffle operations later
      // another transformations and minor filtration
      .repartition(nrInputPartitions)
      // Firstly persist here since objects not fit in memory (Persist 67)
      .persist(StorageLevel.MEMORY_AND_DISK)

    LOG.info(s"First count  = " + df.count)

    val filter: BaseFilter = new BaseFilter()

    LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
    val rddPoints= df
      .map(parse)
      .filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
      .map(convert)
    // Since we will perform count and partitionBy actions, compute all above transformations/ Second persist 
    val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
    val totalPoints = dsPoints.count()
    LOG.info(s"Second count  = $totalPoints")

Answer 1

当你说StorageLevel.MEMORY_AND_DISK spark尝试将所有数据放入内存中时，如果它不适合它会溢出到磁盘。

现在你在这里做多次坚持。在spark中，内存缓存是LRU，因此后面的持久存储将覆盖以前的缓存数据。

即使您指定StorageLevel.MEMORY_AND_DISK当数据被缓存内存中的另一个缓存数据驱逐出来时，火花也不会将其溢出到磁盘。因此，当您执行下一次计数时，它需要重新评估DAG，以便它可以检索缓存中不存在的分区。

我建议你使用StorageLevel.DISK_ONLY来避免这种重新计算。

Answer 2

这是整个场景。

persist和cache也是Spark的转型。在应用任何一个所述转换之后，应该使用任何动作来将RDD或DF缓存到存储器中。

其次，缓存或持久的单位是“分区”。当执行缓存或持久化时，它将仅保存可以在内存中保留的那些分区。一旦遇到任何新动作，将无法保存在整个存储器整个DAG中的剩余分区。

Answer 3

试试

val df = sparkSession.read
      .parquet(bigData)
      .filter(row => dateRange(row.getLong(5), lowerTimeBound, upperTimeBound))
      .as[SafegraphRawData]
      // So repartition here to be able perform shuffle operations later
      // another transformations and minor filtration
      .repartition(nrInputPartitions)
      // Firstly persist here since objects not fit in memory (Persist 67)

df.persist(StorageLevel.MEMORY_AND_DISK)

为什么SPARK在持续运行后重复转换？

3 个答案: