问题摘要: 尽管设置了所需的Hadoop配置(参见尝试),但我无法使用我的Spark程序读取嵌套子目录。 我收到了下面粘贴的错误。
感谢任何帮助。
版本: Spark 2.2.0
输入目录布局:
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939225073/part-00000-3a44cd00-e895-4a01-9ab9-946064b739d4-c000.parquet
/user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=1502939234036/part-00000-cbd47353-0590-4cc1-b10d-c18886df1c25-c000.parquet
...
传递的输入目录参数:
/用户/ akhanolk /数据/ MYQ /解析/ MYQ-APP-日志/待压实/平 - 视图 - 格式/ * / *
尝试(1):
在代码中设置参数...
val sparkSession: SparkSession =SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support & loglevel
import sparkSession.implicits._sparkSession.sparkContext.hadoopConfiguration.setBoolean("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", true)
未在Spark UI中看到配置。
尝试(2):
从CLI传递配置 - spark-submit,并在代码中设置它(见下文)。
spark-submit --conf spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true \...
我确实在Spark UI中看到了配置,但同样的错误 - 无法遍历目录结构..
代码:
//Spark Session
val sparkSession: SparkSession=SparkSession.builder().master("yarn").getOrCreate()
//Recursive glob support
val conf= new SparkConf()
val cliRecursiveGlobConf=conf.get("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive")
import sparkSession.implicits._
sparkSession.sparkContext.hadoopConfiguration.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive", cliRecursiveGlobConf)
错误&总体输出:
完整错误发生在 - https://gist.github.com/airawat/77fbdb821410a5a87dfd29ffaf60fdf9
17/08/18 15:59:29 INFO state.StateStoreCoordinatorRef: Registered
StateStoreCoordinator endpoint
Exception in thread "main" java.io.FileNotFoundException: File /user/akhanolk/data/myq/parsed/myq-app-logs/to-be-compacted/flat-view-format/batch_id=*/* does not exist.