Question

我一直在运行非常大的用例，由于资源限制，我必须限制使用次数。在给定时间正在处理的数据量。因此，为我处理的每次运行（例如200000行）运行一个计数器（例如1到10）。

我的问题是，如果我的源中有100万行，我最终将处理5次运行（1M / 200000）。我也有多个消息来源。因此，例如。

Source A has 1M rows
Source B has 2M rows.

当我在每个来源的循环中时，说，

for (source in Sources) {
  val rddSource = spark.read.table(source)
  rddSource.persist
  rddSource.count // to persist the above rdd

  while (counter < 5) {
       //process every 200000 rows with the above persisted rddSource

    }
}

我要保留上面显示的数据。这将有助于每5次运行Source A，每次运行200000行。

但是对于源B的for循环的下一次迭代，它是否替换了先前缓存的{rddSource}还是我需要这样的东西

for (source in Sources) {
      val rddSource = spark.read.table(source)
      rddSource.unPersist
      rddSource.persist

      rddSource.count // to persist the above rdd

      while (counter < 5) {
           //process every 200000 rows with the above persisted rddSource

        }
    }

Apache Spark不持久和持久

0 个答案: