我尝试使用HiveContext来利用SparkSQL中的HiveQL中的一些Windows函数。但它无法帮我读取文件夹中的递归数据文件(分区文件夹按年和月)。
我的文件夹:
data / outputOozie / 22 / year = 2016 on driver
data / outputOozie / 22 / year = 2016 / month = 10 on driver`
data / outputOozie / 22 / year = 2016 / month = 9 on driver
data / outputOozie / 22 / year = 2016 / month = 10/1 on driver
data / outputOozie / 22 / year = 2016 / month = 10/2 on driver
data / outputOozie / 22 / year = 2016 / month = 10/3 on driver
data / outputOozie / 22 / year = 2016 / month = 9/1 on driver
data / outputOozie / 22 / year = 2016 / month = 9/2 on driver
data / outputOozie / 22 / year = 2016 / month = 9/3 on driver
以下是我启动Hive上下文的方式:
val conf = new SparkConf().setAppName("Extraction process for ").setIfMissing("spark.master", "local[*]")
val sc = SparkContext.getOrCreate(conf)
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
sc.hadoopConfiguration.set("hive.mapred.supports.subdirectories","true")
//val hiveContext = sqlContext.asInstanceOf[HiveContext]
val hiveContext = sqlContext.asInstanceOf[HiveContext]
hiveContext.setConf("spark.sql.parquet.compression.codec", "snappy")
hiveContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
hiveContext.setConf("mapred.input.dir.recursive","true")
hiveContext.setConf("hive.mapred.supports.subdirectories","true")
读取文件:
hiveContext.read.parquet(URLDecoder.decode(partitionLocation.get.toString, " UTF-8&#34)) ==>例外:找不到文件
但是对于SQL Context来说还不错:
val sqlContext = new SQLContext(sc)
sqlContext.setConf("spark.sql.parquet.compression.codec", "snappy")
sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")
感谢您的任何建议!!!!