我有两个包含类似数据的数据源,并希望将它们与scala spark代码进行比较。目前我有下面的代码,但我在Spark UI中看到我的rawData DataFrame正在创建两次,从原始文件中提取40GB,我在使用Job0和Job1在SparkUI中执行我的代码时看到了这一点。如何防止它将数据拉两次?我是否正确使用了多个数据帧?
// Create the sql context.
val sqlContext = new SQLContext(context)
// Pull the data from database, then filter down to what would be outputting, and finally place it in a DataFrame.
val databaseDF: DataFrame = DataFrameUtils.getDataBaseDataInDataFrame.persist(StorageLevel.MEMORY_AND_DISK)
// Pull the data from the raw file into a DataFrame.
/** THIS IS CREATED TWICE, FROM WHAT I SEE IN THE SPARK UI **/
val rawDF: DataFrame = DataFrameUtils.getRawDataInDataFrame(sqlContext, filePath).persist(StorageLevel.MEMORY_AND_DISK)
// Grab the counts for the report using the DataFrames and comparing them.
val sourceTeacherCountDF: DataFrame = DataFrameUtils.getTeacherCountDF(rawDF).persist(StorageLevel.MEMORY_AND_DISK)
val TeacherCoverageCountDF: DataFrame = DataFrameUtils.getTeacherCoverageCountDF(gemDF).persist(StorageLevel.MEMORY_AND_DISK)
val classCountDF: DataFrame = DataFrameUtils.getclasstCountDF(gemDF).persist(StorageLevel.MEMORY_AND_DISK)
val falseNegativeDF: DataFrame = DataFrameUtils.getFalseNegativeCountDF(rawDF, gemDF).persist(StorageLevel.MEMORY_AND_DISK)
val falsePositiveDF: DataFrame = DataFrameUtils.getFalsePositivesCountDF(rawDF, gemDF, sqlContext).persist(StorageLevel.MEMORY_AND_DISK)
var report: DataFrame = rawDF.select(CLASS).unionAll(gemDF.select(CLASS)).distinct()
report = sqlContext.createDataFrame(report.map{case (Row(class: String)) =>
Row(iavm, lookupIavmTitle(class), lookupClassNum(class))}, lookupClassDesc)
val report: DataFrame = report.join(clasCountDF, Seq(class), "left")
.join(teacherCountDF, Seq(class), "left")
.join(teacherCoverageCountDF, Seq(class), "left")
.join(falseNegativeDF, Seq(class), "left")
.join(falsePositiveDF, Seq(class), "left").na.fill(0, report.columns)
report.write
.format("json")
.mode("overwrite")
.save(outputFileName)
答案 0 :(得分:0)
它创造两次的原因是
val falseNegativeDF: DataFrame = DataFrameUtils.getFalseNegativeCountDF(rawDF, gemDF).persist(StorageLevel.MEMORY_AND_DISK)
val falsePositiveDF: DataFrame = DataFrameUtils.getFalsePositivesCountDF(rawDF, gemDF, sqlContext).persist(StorageLevel.MEMORY_AND_DISK)
它被用于创建falseNegetiveDF
和falsePositiveDF
我看到你坚持使用rawDF但是火花持续存在也是一种懒惰的操作。因此,如果您不希望生成两次rawDF,则需要通过对rawDF执行某些操作来完成计算,例如rawDF.count