使用Spark将XML读入数据框

时间:2018-07-16 09:56:30

标签: xml apache-spark dataframe pyspark

我正在尝试使用Spark将XML文件读入数据框。

我根据GitHub上的guide工作。

由于某些原因,属性为id的列为null

我正在此xml file上测试代码。

%pyspark
from pyspark.sql import SQLContext
from pyspark.sql.types import *

AWS_ACCESS_KEY_ID = "*********************"
AWS_SECRET_ACCESS_KEY = "*************************"

sc._jsc.hadoopConfiguration().set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)

sqlContext = SQLContext(sc)
customSchema = StructType([ \
    StructField("_id", StringType(), True), \
    StructField("author", StringType(), True), \
    # StructField("description", StringType(), True), \
    StructField("genre", StringType(), True), \
    StructField("price", DoubleType(), True), \
    StructField("publish_date", StringType(), True), \
    StructField("title", StringType(), True)])


df = sqlContext.read \
    .format('com.databricks.spark.xml') \
    .options(rowTag='book') \
    .load('s3n://######/###/######/books.xml',schema = customSchema)

df.show()

+----+--------------------+---------------+-----+------------+--------------------+
| _id|              author|          genre|price|publish_date|               title|
+----+--------------------+---------------+-----+------------+--------------------+
|null|Gambardella, Matthew|       Computer|44.95|  2000-10-01|XML Developer's G...|
|null|          Ralls, Kim|        Fantasy| 5.95|  2000-12-16|       Midnight Rain|
|null|         Corets, Eva|        Fantasy| 5.95|  2000-11-17|     Maeve Ascendant|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-03-10|     Oberon's Legacy|
|null|         Corets, Eva|        Fantasy| 5.95|  2001-09-10|  The Sundered Grail|
|null|    Randall, Cynthia|        Romance| 4.95|  2000-09-02|         Lover Birds|
|null|      Thurman, Paula|        Romance| 4.95|  2000-11-02|       Splish Splash|
|null|       Knorr, Stefan|         Horror| 4.95|  2000-12-06|     Creepy Crawlies|
|null|        Kress, Peter|Science Fiction| 6.95|  2000-11-02|        Paradox Lost|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-09|Microsoft .NET: T...|
|null|        O'Brien, Tim|       Computer|36.95|  2000-12-01|MSXML3: A Compreh...|
|null|         Galos, Mike|       Computer|49.95|  2001-04-16|Visual Studio 7: ...|
+----+--------------------+---------------+-----+------------+--------------------+

这是XML文件的一部分:

<?xml version="1.0"?>
<catalog>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>
         An in-depth look at creating applications
         with XML.This manual describes Oracle XML DB, and how you can use it to store, generate, manipulate, manage,
         and query XML data in the database.
       </description>
   </book>

</catalog>

0 个答案:

没有答案