我正在尝试使用Spark将XML文件读入数据框。
我根据GitHub上的guide工作。
由于某些原因,属性为id
的列为null
。
我正在此xml file上测试代码。
%pyspark
from pyspark.sql import SQLContext
from pyspark.sql.types import *
AWS_ACCESS_KEY_ID = "*********************"
AWS_SECRET_ACCESS_KEY = "*************************"
sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("_id", StringType(), True), \
StructField("author", StringType(), True), \
# StructField("description", StringType(), True), \
StructField("genre", StringType(), True), \
StructField("price", DoubleType(), True), \
StructField("publish_date", StringType(), True), \
StructField("title", StringType(), True)])
df = sqlContext.read \
.format('com.databricks.spark.xml') \
.options(rowTag='book') \
.load('s3n://######/###/######/books.xml',schema = customSchema)
df.show()
+----+--------------------+---------------+-----+------------+--------------------+
| _id| author| genre|price|publish_date| title|
+----+--------------------+---------------+-----+------------+--------------------+
|null|Gambardella, Matthew| Computer|44.95| 2000-10-01|XML Developer's G...|
|null| Ralls, Kim| Fantasy| 5.95| 2000-12-16| Midnight Rain|
|null| Corets, Eva| Fantasy| 5.95| 2000-11-17| Maeve Ascendant|
|null| Corets, Eva| Fantasy| 5.95| 2001-03-10| Oberon's Legacy|
|null| Corets, Eva| Fantasy| 5.95| 2001-09-10| The Sundered Grail|
|null| Randall, Cynthia| Romance| 4.95| 2000-09-02| Lover Birds|
|null| Thurman, Paula| Romance| 4.95| 2000-11-02| Splish Splash|
|null| Knorr, Stefan| Horror| 4.95| 2000-12-06| Creepy Crawlies|
|null| Kress, Peter|Science Fiction| 6.95| 2000-11-02| Paradox Lost|
|null| O'Brien, Tim| Computer|36.95| 2000-12-09|Microsoft .NET: T...|
|null| O'Brien, Tim| Computer|36.95| 2000-12-01|MSXML3: A Compreh...|
|null| Galos, Mike| Computer|49.95| 2001-04-16|Visual Studio 7: ...|
+----+--------------------+---------------+-----+------------+--------------------+
这是XML文件的一部分:
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>
An in-depth look at creating applications
with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
and query XML data in the database.
</description>
</book>
</catalog>