无法在spark Dataframe中从HDFS加载文件

时间:2016-07-31 19:46:40

标签: scala apache-spark hdfs spark-dataframe

我在本地Windows HDFS(hdfs:// localhost:54310)中存储了一个CSV文件,位于路径/ tmp / home /下。 我想将此文件从HDFS加载到Spark Dataframe。所以我尝试了this

val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()

然后

val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._

spark.sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(path)
  .show()

但是在运行时因以下异常堆栈跟踪而失败:

Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)

C:/ test / sampleApp /是我的示例项目所在的路径。但是我已经指定了HDFS路径。

此外,这与普通的rdd

完全相同
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file

我发现并尝试了this,但没有运气:(

我错过了什么?为什么火花正在考虑我的本地文件系统&amp;不是HDFS?

我在hadoop-hdfs 2.7.2上使用spark 2.0和scala 2.11。

编辑:我试图将其降级为火花1.6.2。我能够让它发挥作用。所以我认为这是spark 2.0中的一个错误

1 个答案:

答案 0 :(得分:0)

只是为了关闭循环。这似乎是火花2.0中的问题,并且已经提出了票证。

https://issues.apache.org/jira/browse/SPARK-15899