Spark阅读CSV问题

时间:2016-11-02 10:16:28

标签: scala apache-spark

我经历了火花壳的奇怪行为。如果工作目录是/我无法读取CSV,但是如果它是任何其他目录,这将起作用。 :

trehiou@cds-stage-ms4 ~> docker run -it --rm -v /data/spark-

test:/mnt/data localhost:5000/spark-master:2.0.1 sh
/ # /app/spark-2.0.1/bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/02 10:01:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/02 10:01:59 WARN spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.17.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1478080919699).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.option("header", "true").option("inferSchema", "true").csv("file:///mnt/data/test.csv").printSchema()
java.io.IOException: No FileSystem for scheme: null
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:115)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
  at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
  at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
  at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
  at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
  at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
  at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
  at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:413)
  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:349)
  ... 48 elided

/ # cd /mnt/data/
/mnt/data # /app/spark-2.0.1/bin/spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
16/11/02 10:02:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/11/02 10:02:26 WARN spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://172.17.0.1:4040
Spark context available as 'sc' (master = local[*], app id = local-1478080946728).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.option("header", "true").option("inferSchema", "true").csv("file:///mnt/data/test.csv").printSchema()
root
 |-- 2MASS: string (nullable = true)
 |-- RAJ2000: double (nullable = true)
 |-- DEJ2000: double (nullable = true)
 |-- errHalfMaj: double (nullable = true)
 |-- errHalfMin: double (nullable = true)
 |-- errPosAng: integer (nullable = true)
 |-- Jmag: double (nullable = true)
 |-- Hmag: double (nullable = true)
 |-- Kmag: double (nullable = true)
 |-- e_Jmag: double (nullable = true)
 |-- e_Hmag: double (nullable = true)
 |-- e_Kmag: double (nullable = true)
 |-- Qfl: string (nullable = true)
 |-- Rfl: integer (nullable = true)
 |-- X: integer (nullable = true)
 |-- MeasureJD: double (nullable = true)

我想了解为什么会发生这种情况,因为这没有任何意义。这个例子中的PWD是我读取的文件存储,但我测试了为什么其他随机路径,它也可以工作。也许这是一个错误,我应该提交错误报告?

编辑:从HDFS读取时的行为完全相同

编辑2:通过spark-submit

启动作业时,我得到完全相同的行为

1 个答案:

答案 0 :(得分:1)

spark.sql.warehouse.dir的默认值为{working-dir}/spark-warehouse。从root运行时,此变为//spark-warehouse,并且在此类路径上调用hadoopPath.getFileSystem(hadoopConf)时(在SessionCatalog.makeQualifiedPath中),Hadoop无法识别该方案。

解决方法很简单 - 在使用file:/开头的一些合理值覆盖此参数时启动您的spark shell,例如:

/app/spark-2.0.1/bin/spark-shell --conf spark.sql.warehouse.dir=file:/tmp/spark-warehouse

这种不愉快的行为可能与this open issue有关,但我不确定 - 可能是一个单独的问题,因为它确实适用于非根值。