如何使用spark.read函数在spark中并行处理文件

时间:2018-05-24 10:36:23

标签: scala apache-spark foreach apache-spark-sql

我有一个包含文件列表的文本文件。目前,我正在顺序遍历我的文件列表

我的文件列表如下所示,

D:\Users\bramasam\Documents\sampleFile1.txt
D:\Users\Documents\sampleFile2.txt

并为每个文件执行以下代码,

val df = spark.read
   .format("org.apache.spark.csv")
   .option("header", false)
   .option("inferSchema", false)
   .option("delimiter", "|")
   .schema(StructType(fields)) //calling a method to find schema
   .csv(fileName_fetched_foreach)
   .toDF(old_column_string: _*)

 df.write.format("orc").save(target_file_location)

我要做的是并行执行上述代码,而不是序列,因为文件之间没有依赖关系。所以,我正在尝试类似下面的事情,但面临错误,

  //read the file which has the file list
    spark.read.textFile("D:\\Users\\Documents\\ORC\\fileList.txt").foreach { line =>
      val tempTableName = line.substring(line.lastIndexOf("\\"),line.lastIndexOf("."))
      val df = spark.read
        .format("org.apache.spark.csv")
        .option("header", false)
        .option("inferSchema", false)
        .option("delimiter", "|")
        .schema(StructType(fields))
        .csv(line)
        .toDF(old_column_string: _*)
        .registerTempTable(tempTableName)

      val result = spark.sql(s"select $new_column_string from $tempTableName") //reordering column order on how it has to be stored

      //Note: writing to ORC needs Hive support. So, make sure the systax is right
      result.write.format("orc").save("D:\\Users\\bramasam\\Documents\\SCB\\ORCFile")
    }
  }

我面临以下错误,

java.lang.NullPointerException
    at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:135)
    at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
    at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:689)
    at org.apache.spark.sql.SparkSession.read(SparkSession.scala:645)
    at ConvertToOrc$$anonfun$main$1.apply(ConvertToOrc.scala:25)
    at ConvertToOrc$$anonfun$main$1.apply(ConvertToOrc.scala:23)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)

2 个答案:

答案 0 :(得分:0)

你应该做的是将文件放在目录中,然后让spark读取整个目录,如果需要,用文件名为每个文件添加一列

spark.read.textFile("D:\\Users\\Documents\\ORC\\*")

    • 是阅读所有文件

答案 1 :(得分:0)

请参阅@samthebest answer

在您的情况下,您应该传递整个目录,例如:

spark.read.textFile("D:\\Users\\Documents\\ORC")

如果您想以递归方式阅读dirs,请参阅to this answer

spark.read.textFile("D:\\Users\\Documents\\ORC\\*\\*")