如何使用包含JSON文件的文件夹读取目录:Spark Scala

时间:2016-07-05 14:42:51

标签: json scala apache-spark spark-dataframe

我从目录(带有json文件的文件夹)读取时不断获取此文件。我用过:

// sc : An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val df = sqlContext.jsonFile("s3://testData")
df.show()

错误:

java.io.IOException: No input paths specified in job 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:173) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) 
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
    at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) 
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
     at scala.Option.getOrElse(Option.scala:120)
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
      at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
      at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
      at scala.Option.getOrElse(Option.scala:120)
       at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
       at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) 
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) 
        at scala.Option.getOrElse(Option.scala:120)

我的文件系统如下:

testData - 具有3个文件夹(00,01,02)的目录各有1个文件/文件夹

testData/00/temp1.json.gz 
testData/01/temp2.json.gz 
testData/02/temp3.json.gz  

我正在使用火花1.5我正在阅读的方式有什么问题吗?

1 个答案:

答案 0 :(得分:0)

效率不高但你可以sqlContext.jsonFile("s3://testData/*/*")