当代码运行第二次迭代时,无法将结果存储在hdfs中

时间:2018-01-08 18:27:51

标签: scala apache-spark spark-dataframe rdd

我是spark和scala的新手,并一直在尝试在spark中实现数据清理。下面的代码检查一列的缺失值并将其存储在outputrdd中并运行循环以计算缺失值。当文件中只有一个缺失值时,代码运行良好。由于hdfs不允许在同一位置再次写入,因此如果有多个缺失值,则会失败。一旦计算完所有事件的缺失值,你能否协助将finalrdd写入特定位置。

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("app").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val files = sc.wholeTextFiles("/input/raw_files/")
val file = files.map { case (filename, content) => filename }

file.collect.foreach(filename => {
  cleaningData(filename)
})

def cleaningData(file: String) = {
  //headers has column headers of the files
  var hdr = headers.toString()
  var vl = hdr.split("\t")
  sqlContext.clearCache()
  if (hdr.contains("COLUMN_HEADER")) {
    //Checks for missing values in dataframe and stores missing values' in outputrdd
    if (!outputrdd.isEmpty()) {
      logger.info("value is zero then performing further operation")
      val outputdatetimedf = sqlContext.sql("select date,'/t',time from cpc where kwh = 0")
      val outputdatetimerdd = outputdatetimedf.rdd
      val strings = outputdatetimerdd.map(row => row.mkString).collect()
      for (i <- strings) {
        if (Coddition check) {
            //Calculates missing value and stores in finalrdd
              finalrdd.map { x => x.mkString("\t") }.saveAsTextFile("/output")
            logger.info("file is written in file")
          }
        }
      }
    }
}

}``

1 个答案:

答案 0 :(得分:0)

目前尚不清楚(Coddition check)在您的示例中如何运作。 在任何情况下,函数.saveAsTextFile("/output")只应调用一次。

所以我会把你的例子重写成:

val strings = outputdatetimerdd
   .map(row => row.mkString)
   .collect() // perhaps '.collect()' is redundant

val finalrdd = strings
   .filter(str => Coddition check str) //don't know how this Coddition works
   .map (x => x.mkString("\t"))

// this part is called only once but not in a loop
finalrdd.saveAsTextFile("/output")
logger.info("file is written in file")