我已经使用Apache Spark构建了一个推荐系统,并将数据集本地存储在我的项目文件夹中,现在我需要从HDFS访问这些文件。
如何使用Spark从HDFS读取文件?
这是我初始化Spark会话的方式:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local")
.set("fs.default.name", "hdfs://localhost:54310").set("fs.defaultFS", "hdfs://localhost:54310"));
Configuration conf = context.hadoopConfiguration();
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop-3.1.2/etc/hadoop/hdfs-site.xml"));
conf.set("fs.hdfs.impl", "org.apache.hadoop.hdfs.DistributedFileSystem");
conf.set("fs.hdfs.impl", "org.apache.hadoop.fs.LocalFileSystem");
this.session = SparkSession.builder().sparkContext(context).getOrCreate();
System.out.println(conf.getRaw("fs.default.name"));
System.out.println(context.getConf().get("fs.defaultFS"));
所有输出都返回hdfs://localhost:54310
,这是我的HDFS的正确uri。
尝试从HDFS读取文件时:
session.read().option("header", true).option("inferSchema", true).csv("hdfs://localhost:54310/recommendation_system/movies/ratings.csv").cache();
我收到此错误:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:54310/recommendation_system/movies/ratings.csv, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:730)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:86)
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:636)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:930)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:631)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:454)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:65)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:281)
at org.apache.hadoop.fs.Globber.glob(Globber.java:149)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:2034)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:257)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:253)
at scala.Option.getOrElse(Option.scala:138)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:251)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.java.JavaRDDLike.collect(JavaRDDLike.scala:361)
at org.apache.spark.api.java.JavaRDDLike.collect$(JavaRDDLike.scala:360)
at org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
at com.dastamn.sparkml.analytics.SparkManager.<init>(SparkManager.java:36)
at com.dastamn.sparkml.Main.main(Main.java:22)
我该怎么解决这个问题?
答案 0 :(得分:1)
粘贴了代码段中的几件事:
1.当必须将hadoop属性设置为使用SparkConf
的一部分时,必须在其前面加上spark.hadoop.
,在这种情况下,密钥fs.default.name
必须设置为{{1} }以及其他属性。
2. spark.hadoop.fs.default.name
函数的参数不必讲述HDFS端点,Spark将从默认属性中找出它,因为它已经设置。
session.read()。option(“ header”,true).option(“ inferSchema”, true).csv(“ / recommendation_system / movies / ratings.csv”)。cache();
如果未将默认文件系统属性设置为HadoopConfiguration的一部分,则Spark / Hadoop需要完整的URI才能确定要使用的文件系统。
(而且不使用对象名csv
)
3.在上述情况下,似乎Hadoop无法找到用于conf
URI前缀的文件系统,并在这种情况下使用了默认文件系统hdfs://
(因为它正在使用{{1 }}处理路径。
确保local
出现在具有RawLocalFileSystem
的类路径中,以使HDFS的FS对象无效。
答案 1 :(得分:0)
以下是解决问题的配置:
SparkContext context = new SparkContext(new SparkConf().setAppName("spark-ml").setMaster("local[*]")
.set("spark.hadoop.fs.default.name", "hdfs://localhost:54310").set("spark.hadoop.fs.defaultFS", "hdfs://localhost:54310")
.set("spark.hadoop.fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName())
.set("spark.hadoop.fs.hdfs.server", org.apache.hadoop.hdfs.server.namenode.NameNode.class.getName())
.set("spark.hadoop.conf", org.apache.hadoop.hdfs.HdfsConfiguration.class.getName()));
this.session = SparkSession.builder().sparkContext(context).getOrCreate();