我正在尝试在PySpark3 Jyupter笔记本(在Azure中运行)中读取XML文件。
我有此代码:
df = spark.read.load("wasb:///data/test/Sample Data.xml")
但是我仍然收到错误java.io.IOException: Could not read footer for file
:
An error occurred while calling o616.load.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task 0.3 in stage 10.0 (TID 43, wn2-xxxx.cloudapp.net, executor 2): java.io.IOException: Could not read footer for file: FileStatus{path=wasb://xxxx.blob.core.windows.net/data/test/Sample Data.xml; isDirectory=false; length=6947; replication=0; blocksize=0; modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}
从长度看,我知道它到达文件了-与xml文件大小匹配-但在那之后卡住了吗?
有什么想法吗?
谢谢。
答案 0 :(得分:1)
请参考下面的两个博客,我认为它们可以完全回答您的问题。
代码如下。
session = SparkSession.builder.getOrCreate()
session.conf.set(
"fs.azure.account.key.<storage-account-name>.blob.core.windows.net",
"<your-storage-account-access-key>"
)
# OR SAS token for a container:
# session.conf.set(
# "fs.azure.sas.<container-name>.blob.core.windows.net",
# "<sas-token>"
# )
# your Sample Data.xml file in the virtual directory `data/test`
df = session.read.format("com.databricks.spark.xml") \
.options(rowTag="book").load("wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/data/test/")
如果您使用的是Azure Databricks,我认为代码将按预期工作。否则,可能需要在Apache Spark集群中安装com.databricks.spark.xml
库。
希望有帮助。