Spark SQL - 文件夹中的递归读取

时间:2017-07-16 06:12:28

标签: hive apache-spark-sql recursive-query parquet

我尝试使用HiveContext来利用SparkSQL中的HiveQL中的一些Windows函数。但它无法帮我读取文件夹中的递归数据文件(分区文件夹按年和月)。

我的文件夹:

  

data / outputOozie / 22 / year = 2016 on driver

     

data / outputOozie / 22 / year = 2016 / month = 10 on driver`

     

data / outputOozie / 22 / year = 2016 / month = 9 on driver

     

data / outputOozie / 22 / year = 2016 / month = 10/1 on driver

     

data / outputOozie / 22 / year = 2016 / month = 10/2 on driver

     

data / outputOozie / 22 / year = 2016 / month = 10/3 on driver

     

data / outputOozie / 22 / year = 2016 / month = 9/1 on driver

     

data / outputOozie / 22 / year = 2016 / month = 9/2 on driver

     

data / outputOozie / 22 / year = 2016 / month = 9/3 on driver

以下是我启动Hive上下文的方式:

val conf = new SparkConf().setAppName("Extraction process for ").setIfMissing("spark.master", "local[*]")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
sc.hadoopConfiguration.set("hive.mapred.supports.subdirectories","true")
//val hiveContext = sqlContext.asInstanceOf[HiveContext] 
val hiveContext = sqlContext.asInstanceOf[HiveContext]
hiveContext.setConf("spark.sql.parquet.compression.codec", "snappy")
hiveContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
hiveContext.setConf("mapred.input.dir.recursive","true") 
hiveContext.setConf("hive.mapred.supports.subdirectories","true")

读取文件:

  

hiveContext.read.parquet(URLDecoder.decode(partitionLocation.get.toString,   " UTF-8&#34))   ==>例外:找不到文件

但是对于SQL Context来说还不错:

val sqlContext = new SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

感谢您的任何建议!!!!

0 个答案:

没有答案