Spark读取BlobStorage文件“ java.io.IOException:方案没有https的文件系统”

时间:2019-09-23 10:42:13

标签: apache-spark apache-spark-sql azure-blob-storage apache-spark-xml

当前,我正在使用azure-storage-blob和hadoop-azure软件包将文件从Blob存储下载到本地。

...
String url = "https://blob_storage_url";

String filename = url.replaceFirst("https.*/", "");

// Setup the cloud storage account
String storageConnectionString = "...";
CloudStorageAccount account = CloudStorageAccount.parse(storageConnectionString);

// Create a blob service client
CloudBlobClient blobClient = account.createCloudBlobClient();

// Get a reference to a container
CloudBlobContainer container = blobClient.getContainerReference(containerName);

for (ListBlobItem blobItem : container.listBlobs(filename)) {
    // If the item is a blob, not a virtual directory
    if (blobItem instanceof CloudBlockBlob) {
        // Download the file
        CloudBlockBlob retrievedBlob = (CloudBlockBlob) blobItem;
        retrievedBlob.downloadToFile(filename);
    }
}
...

这些下载的文件实际上是XML文件。然后,我必须处理每个内容。为此,我使用spark-xml_2.11(com.databricks.spark.xml)包。

StructType schema = new StructType()
    .add("attr1", DataTypes.StringType, false)
    .add("attr2", DataTypes.IntegerType, false)
    ... other_structFields_or_structTypes;

Dataset<Row> dataset = sparkSession.read()
    .format("com.databricks.spark.xml")
    .schema(schema)
    .load(filename)

load()方法需要一个路径(由本地或分布式文件系统支持的数据)。那么,是否可以选择直接从Blob存储中加载它们?

我找到了本指南https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html,但是第一种选择是将Azure Blob存储容器安装到DBFS,需要一个Databrick群集。

使用第二个选项“直接访问Azure Blob存储”,我之前测试过设置帐户访问密钥。

sparkSession.sparkContext().hadoopConfiguration().set(
    "fs.azure.account.key.<my-storage-account-name>.blob.core.windows.net",
    "<my-storage-account-access-key>"
);

StructType schema = new StructType()
    .add("attr1", DataTypes.StringType, false)
    .add("attr2", DataTypes.IntegerType, false)
    ... other_structFields_or_structTypes;

Dataset<Row> dataset = sparkSession.read()
    .format("com.databricks.spark.xml")
    .schema(schema)
    .load(filename) # also I tried with the full url

但是引发了以下异常:

"java.io.IOException: No FileSystem for scheme: https". 

此外,我尝试将协议更改为wasbs,但再次引发了类似的异常:

"java.io.IOException: No FileSystem for scheme: wasbs".

请提出任何建议或评论?

0 个答案:

没有答案