Question

我正在尝试使用pyspark从Azure Data Lake Gen2读取文件。该文件的示例URL为https://dummmy_store_name.dfs.core.windows.net/gen2/test_folder/error.txt。

spark = SparkSession.builder.appName('abc').getOrCreate()
client_id = "<clinet_id>"
client_secret_key = "<secret_key>"
refresh_url = "https://login.microsoftonline.com/<tenant_id>/oauth2/token"
spark_context = spark.sparkContext

spark_context._jsc.hadoopConfiguration().set("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
spark_context._jsc.hadoopConfiguration().set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
spark_context._jsc.hadoopConfiguration().set("dfs.adls.oauth2.client.id", client_id)
spark_context._jsc.hadoopConfiguration().set("dfs.adls.oauth2.credential", client_secret_key)
spark_context._jsc.hadoopConfiguration().set("dfs.adls.oauth2.refresh.url", refresh_url)


tf = spark.read.text("https://dummmy_store_name.dfs.core.windows.net/gen2/test_folder/error.txt")
print('\n\n=========\n\n\n {} \n\n\n========='.format(tf.count()))

我已经尝试过此代码（稍作修改），用于azure数据湖（第1代）中的文件。对于第1代Azure数据湖，它运行良好。区别在于第1代的文件url以adl开头。对于前）adl://<store_name>.datalake.......... 因此，由于我在spark-submit中添加了软件包以支持adl文件系统，因此上述代码运行良好。

spark-submit --master local[1] --packages com.microsoft.azure:azure-data-lake-store-sdk:2.0.11,org.apache.hadoop:hadoop-azure-datalake:3.0.0-alpha2,com.databricks:spark-xml_2.11:0.4.1 ./spark_azure_gen2.py

当我运行第二代代码的spark-submit时，出现以下错误。 java.io.IOException：方案无文件系统：https 。支持https文件系统应包含哪些软件包？

java.io.IOException：方案无文件系统：https

0 个答案: