我有一个包含文件列表的文本文件。目前,我正在顺序遍历我的文件列表
我的文件列表如下所示,
D:\Users\bramasam\Documents\sampleFile1.txt
D:\Users\Documents\sampleFile2.txt
并为每个文件执行以下代码,
val df = spark.read
.format("org.apache.spark.csv")
.option("header", false)
.option("inferSchema", false)
.option("delimiter", "|")
.schema(StructType(fields)) //calling a method to find schema
.csv(fileName_fetched_foreach)
.toDF(old_column_string: _*)
df.write.format("orc").save(target_file_location)
我要做的是并行执行上述代码,而不是序列,因为文件之间没有依赖关系。所以,我正在尝试类似下面的事情,但面临错误,
//read the file which has the file list
spark.read.textFile("D:\\Users\\Documents\\ORC\\fileList.txt").foreach { line =>
val tempTableName = line.substring(line.lastIndexOf("\\"),line.lastIndexOf("."))
val df = spark.read
.format("org.apache.spark.csv")
.option("header", false)
.option("inferSchema", false)
.option("delimiter", "|")
.schema(StructType(fields))
.csv(line)
.toDF(old_column_string: _*)
.registerTempTable(tempTableName)
val result = spark.sql(s"select $new_column_string from $tempTableName") //reordering column order on how it has to be stored
//Note: writing to ORC needs Hive support. So, make sure the systax is right
result.write.format("orc").save("D:\\Users\\bramasam\\Documents\\SCB\\ORCFile")
}
}
我面临以下错误,
java.lang.NullPointerException
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:135)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:133)
at org.apache.spark.sql.DataFrameReader.<init>(DataFrameReader.scala:689)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:645)
at ConvertToOrc$$anonfun$main$1.apply(ConvertToOrc.scala:25)
at ConvertToOrc$$anonfun$main$1.apply(ConvertToOrc.scala:23)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
答案 0 :(得分:0)
你应该做的是将文件放在目录中,然后让spark读取整个目录,如果需要,用文件名为每个文件添加一列
spark.read.textFile("D:\\Users\\Documents\\ORC\\*")
答案 1 :(得分:0)
请参阅@samthebest answer。
在您的情况下,您应该传递整个目录,例如:
spark.read.textFile("D:\\Users\\Documents\\ORC")
如果您想以递归方式阅读dirs,请参阅to this answer:
spark.read.textFile("D:\\Users\\Documents\\ORC\\*\\*")