如何防止创建DataFrame两次/使用Multi DataFrames

时间:2017-11-28 15:41:57

标签: scala apache-spark spark-dataframe

我有两个包含类似数据的数据源,并希望将它们与scala spark代码进行比较。目前我有下面的代码,但我在Spark UI中看到我的rawData DataFrame正在创建两次,从原始文件中提取40GB,我在使用Job0和Job1在SparkUI中执行我的代码时看到了这一点。如何防止它将数据拉两次?我是否正确使用了多个数据帧?

    // Create the sql context.
    val sqlContext = new SQLContext(context)

    // Pull the data from database, then filter down to what would be outputting, and finally place it in a DataFrame.
    val databaseDF: DataFrame = DataFrameUtils.getDataBaseDataInDataFrame.persist(StorageLevel.MEMORY_AND_DISK)

    // Pull the data from the raw file into a DataFrame.
/** THIS IS CREATED TWICE, FROM WHAT I SEE IN THE SPARK UI **/
    val rawDF: DataFrame = DataFrameUtils.getRawDataInDataFrame(sqlContext, filePath).persist(StorageLevel.MEMORY_AND_DISK)

    // Grab the counts for the report using the DataFrames and comparing them.
    val sourceTeacherCountDF: DataFrame = DataFrameUtils.getTeacherCountDF(rawDF).persist(StorageLevel.MEMORY_AND_DISK)
    val TeacherCoverageCountDF: DataFrame = DataFrameUtils.getTeacherCoverageCountDF(gemDF).persist(StorageLevel.MEMORY_AND_DISK)
    val classCountDF: DataFrame = DataFrameUtils.getclasstCountDF(gemDF).persist(StorageLevel.MEMORY_AND_DISK)
    val falseNegativeDF: DataFrame = DataFrameUtils.getFalseNegativeCountDF(rawDF, gemDF).persist(StorageLevel.MEMORY_AND_DISK)
    val falsePositiveDF: DataFrame = DataFrameUtils.getFalsePositivesCountDF(rawDF, gemDF, sqlContext).persist(StorageLevel.MEMORY_AND_DISK)


    var report: DataFrame = rawDF.select(CLASS).unionAll(gemDF.select(CLASS)).distinct()
    report = sqlContext.createDataFrame(report.map{case (Row(class: String)) =>
      Row(iavm, lookupIavmTitle(class), lookupClassNum(class))}, lookupClassDesc)

    val report: DataFrame = report.join(clasCountDF, Seq(class), "left")
      .join(teacherCountDF, Seq(class), "left")
      .join(teacherCoverageCountDF, Seq(class), "left")
      .join(falseNegativeDF, Seq(class), "left")
          .join(falsePositiveDF, Seq(class), "left").na.fill(0, report.columns)

report.write
      .format("json")
      .mode("overwrite")
      .save(outputFileName)

1 个答案:

答案 0 :(得分:0)

它创造两次的原因是

 val falseNegativeDF: DataFrame = DataFrameUtils.getFalseNegativeCountDF(rawDF, gemDF).persist(StorageLevel.MEMORY_AND_DISK)
 val falsePositiveDF: DataFrame = DataFrameUtils.getFalsePositivesCountDF(rawDF, gemDF, sqlContext).persist(StorageLevel.MEMORY_AND_DISK)

它被用于创建falseNegetiveDFfalsePositiveDF

两次

我看到你坚持使用rawDF但是火花持续存在也是一种懒惰的操作。因此,如果您不希望生成两次rawDF,则需要通过对rawDF执行某些操作来完成计算,例如rawDF.count