我必须处理大约500 GB的压缩数据。 我必须对这些数据进行不同的过滤。 我对Spark在RDD上执行的持续行动非常不确定。 以下是我的代码:
val mypath = paths(0)
val df = sparkSession.read
.parquet(mypath)
.as[SafegraphRawData]
// Persist here since uncompressed JAVA objects not fit in memory
.persist(StorageLevel.MEMORY_AND_DISK)
val filter: BaseFilter = new BaseFilter()
val upperProcessingDate = processingDate.plusDays(appConf.duration)
LOG.info(s"Filter between $processingDate and $upperProcessingDate")
val lowerTimeBound = processingDate.getMillis();
val upperTimeBound = upperProcessingDate.getMillis()-1;
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
// This transform will reduce data
.filter(dateRange(_, lowerTimeBound, upperTimeBound))
// So repartition here to be able perform shuffle operations later
.repartition(nrInputPartitions)
// another transformations and minor filtration
.map(parse)
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
.map(convert)
LOG.info(s"Number of partitions: " + rddPoints.rdd.getNumPartitions)
// Since we will perform count and partitionBy actions, compute all above transformations
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"in safegraph.load: totalpoints = $totalPoints")
dsPoints.show()
LOG.info(s"show results...")
dsPoints
// ...
val nrOutputPartitions = appConf.getNrOutputPartitions()
var exportRdd = stage
if (nrOutputPartitions > 0) {
LOG.info(s"Coalescing parquet preExportRdd to ${appConf.getNrOutputPartitions()} partitions")
exportRdd = stage.coalesce(nrOutputPartitions)
}
exportRdd.toDF().write.partitionBy("y", "m", "d", "r", "p")
.format("parquet")
.mode(SaveMode.Append)
.save(appConf.getS3DestinationUrl())
exportRdd.unpersist()
我希望这段代码可以生成一些简单的DAG图,但是对于计数操作,我会得到3个阶段的Job和我无法理解的大DAG。
第12阶段是非常明确的,它会执行所有转换并持续存在,但为什么它会在第13阶段完全重复? 哪个存储级别更好选择?如果有MEMORY_AND_DISK,为什么有人只需要使用DISK?