我必须处理大约500 GB的压缩数据。 我必须对这些数据进行不同的过滤。 我对Spark在RDD上执行的持续行动非常不确定。 以下是我的代码:
val mypath = paths(0)
val df = sparkSession.read
// Persist here since uncompressed JAVA objects not fit in memory
val filter: BaseFilter = new BaseFilter()
val upperProcessingDate = processingDate.plusDays(appConf.duration)
LOG.info(s"Filter between $processingDate and $upperProcessingDate")
val lowerTimeBound = processingDate.getMillis();
val upperTimeBound = upperProcessingDate.getMillis()-1;
LOG.info(s"Number of partitions: " + df.rdd.getNumPartitions)
val rddPoints= df
// This transform will reduce data
.filter(dateRange(_, lowerTimeBound, upperTimeBound))
// So repartition here to be able perform shuffle operations later
// another transformations and minor filtration
.filter(filter.IsValid(_, deviceStageMetricService, providerdevicelist, sparkSession))
LOG.info(s"Number of partitions: " + rddPoints.rdd.getNumPartitions)
// Since we will perform count and partitionBy actions, compute all above transformations
val dsPoints = rddPoints.persist(StorageLevel.MEMORY_AND_DISK)
val totalPoints = dsPoints.count()
LOG.info(s"in safegraph.load: totalpoints = $totalPoints")
LOG.info(s"show results...")
// ...
val nrOutputPartitions = appConf.getNrOutputPartitions()
var exportRdd = stage
if (nrOutputPartitions > 0) {
LOG.info(s"Coalescing parquet preExportRdd to ${appConf.getNrOutputPartitions()} partitions")
exportRdd = stage.coalesce(nrOutputPartitions)
exportRdd.toDF().write.partitionBy("y", "m", "d", "r", "p")
第12阶段是非常明确的,它会执行所有转换并持续存在,但为什么它会在第13阶段完全重复? 哪个存储级别更好选择?如果有MEMORY_AND_DISK,为什么有人只需要使用DISK?