Question

我必须处理大约500 GB的压缩数据。我必须对这些数据进行不同的过滤。我对Spark在RDD上执行的持续行动非常不确定。以下是我的代码：

val mypath = paths(0)

val df = sparkSession.read
  .parquet(mypath)
  .as[SafegraphRawData]
  // Persist here since uncompressed JAVA objects not fit in memory
  .persist(StorageLevel.MEMORY_AND_DISK)

val filter: BaseFilter = new BaseFilter()

val upperProcessingDate = processingDate.plusDays(appConf.duration)
LOG.info(s"Filter between $processingDate and $upperProcessingDate")
val lowerTimeBound = processingDate.getMillis();
val upperTimeBound = upperProcessingDate.getMillis()-1;

LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
  // This transform will reduce data
  .filter(dateRange(_, lowerTimeBound, upperTimeBound))
  // So repartition here to be able perform shuffle operations later
  .repartition(nrInputPartitions)
  // another transformations and minor filtration
  .map(parse)
  .filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
  .map(convert)

LOG.info(s"Number of partitions: " + rddPoints.rdd.getNumPartitions)
// Since we will perform count and partitionBy actions, compute all above transformations
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"in safegraph.load: totalpoints  = $totalPoints")

dsPoints.show()
LOG.info(s"show results...")
dsPoints
// ...
val nrOutputPartitions = appConf.getNrOutputPartitions()

var exportRdd = stage
if (nrOutputPartitions > 0) {
  LOG.info(s"Coalescing parquet preExportRdd to ${appConf.getNrOutputPartitions()} partitions")
  exportRdd = stage.coalesce(nrOutputPartitions)
}

 exportRdd.toDF().write.partitionBy("y", "m", "d", "r", "p")
  .format("parquet")
  .mode(SaveMode.Append)
  .save(appConf.getS3DestinationUrl())

exportRdd.unpersist()

我希望这段代码可以生成一些简单的DAG图，但是对于计数操作，我会得到3个阶段的Job和我无法理解的大DAG。

第12阶段是非常明确的，它会执行所有转换并持续存在，但为什么它会在第13阶段完全重复？哪个存储级别更好选择？如果有MEMORY_AND_DISK，为什么有人只需要使用DISK？

SPARK加载大数据集，有哪些持久性选择？

0 个答案: