我正在尝试使用spark submit命令传递文件,然后跨节点读取它。 Spark代码是:
SparkConf sparkConf = new SparkConf().setAppName(jobName);
JavaSparkContext sparkcontext = new JavaSparkContext(sparkConf);
// System.out.println("The sqlfile recieved is ::"+sqlScript);
JavaRDD<String> fil = sparkcontext.textFile(SparkFiles.get(sqlScript));
//Lines for debug
String sparkFile=SparkFiles.get(sqlScript);
System.out.println("The path of sqlfile recieved from SparkFiles is ::"+sparkFile);
System.out.println("The root dir SparkFiles is ::"+SparkFiles.getRootDirectory());
List<String> strlst = fil.collect();
for(String str:strlst){
System.out.println("The string is :::");
System.out.println(str);
}
Spark提交命令是:
spark-submit --class com.load.Test --master yarn --deploy-mode cluster --queue queueName --executor-memory 8G --conf spark.yarn.executor.memoryOverhead=2048 --conf spark.network.timeout=5000s --files /path/of/file.hql /path/of/jar/Load-0.0.2-SNAPSHOT.jar LoadJob file.hql
尽管SparkFiles显示路径但无法读取它(由于机密性问题而更改了日志中的路径):
> Log Contents:
> The sqlfile recieved is ::file.hql
> The path of sqlfile recieved from SparkFiles is ::/tmp/hadoop-mapr/nm-local-dir/appcache/application_1495017060619XXX/file.hql
> The root dir SparkFiles is ::/tmp/hadoop-mapr/nm-local-dir/usercache/application_1495017060619_294966/spark-cd981514-4f
> End of LogType:stdout
>
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> maprfs:/tmp/hadoop-mapr/nm-local-dir/spark-cd981514-4f42-490a-aacd-3c4a9a417319/userFiles-7297d880-99a5-4a6e-8464-9c220bba77c3/file.hql
> at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:289)
> at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
> at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:317)
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)