Question

我们有一个自定义文件系统类，它是hadoop.fs.FileSystem的扩展。该文件系统的uri方案为abfs：///。已在此数据上创建外部配置单元表。

CREATE EXTERNAL TABLE testingCustomFileSystem (a string, b int, c double) PARTITIONED BY dt
STORED AS PARQUET
LOCATION 'abfs://<host>:<port>/user/name/path/to/data/'

使用loginbeeline，我可以查询表，它将获取结果。

现在我正尝试使用spark.table（'testingCustomFileSystem'）将同一张表加载到spark数据帧中，它将引发以下异常

    java.io.IOException: No FileSystem for scheme: abfs
  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2586)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:77)
  at org.apache.spark.sql.execution.datasources.CatalogFileIndex$$anonfun$2.apply(CatalogFileIndex.scala:75)
  at scala.collection.immutable.Stream.map(Stream.scala:418)

包含CustomFileSystem（定义abfs：//方案）的jar已加载到类路径中，并且也可用。

spark.table如何解析元存储中的配置单元表定义并解析uri？

Answer 1

研究了spark中的配置后，我偶然注意到通过设置以下hadoop配置，我能够解决。

hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)

在Spark中，此设置是在sparkSession创建期间完成的（仅用于appName和

喜欢

val spark = SparkSession
            .builder()
            .setAppName("Name")
            .setMaster("yarn")
            .getOrCreate()

spark.sparkContext
     .hadoopConfiguration.set("fs.abfs.impl",<fqcn of the FileSystemImplementation>)

成功了！

spark.table失败，并出现java.io.Exception：方案没有文件系统：abfs

1 个答案: