用于spark中xml处理的复杂自定义模式

时间:2018-02-02 11:22:17

标签: xml scala apache-spark dataset

我正在尝试为spark加载xml文件编写自定义架构。在我的情况下,我需要访问两个标签,即us-provisional-application标签下的us-related-documents us-related-documents: struct (nullable = true) | | |-- related-publication: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- document-id: struct (nullable = true) | | | | |-- country: string (nullable = true) | | | | |-- date: long (nullable = true) | | | | |-- doc-number: long (nullable = true) | | | | |-- kind: string (nullable = true) | |-- us-provisional-application: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- document-id: struct (nullable = true) | | | | |-- country: string (nullable = true) | | | | |-- date: long (nullable = true) | | | | |-- doc-number: long (nullable = true)

这是我的数据结构:

    StructField("us-related-documents", StructType(
List(StructField("us-provisional-application",StructType(
List(StructField("document-id",
ArrayType(StructType(
List(
StructField("doc-number", StringType, nullable = true),
StructField("country", StringType, nullable = true),
StructField("kind", StringType, nullable = true),
StructField("date", LongType, nullable = true),
StructField("name", StringType, nullable = true)
)
))))
)))
))

以下是我访问一个代码的代码:

GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0

当我尝试向上述架构添加另一个标签时,它失败了 如何访问剩余标签?

2 个答案:

答案 0 :(得分:1)

  

以下是我访问一个代码的代码

首先,这是数据集架构,而不是访问任何内容的代码。您应该使用Dataset对象的select()方法或SparkSQL来访问数据。

例如

df.select($"us-related-documents.related-publication")

我不确定您为什么要重建架构的S​​tructType对象,但该代码与打印的架构不匹配...

us-provisional-application必须是ArrayType,而不是StructType

document-id是结构,而不是数组

您的Scala代码无论如何都更接近related-publication结构,所以这可能是第一个错误。

doc-number是一个很长的字符串,而不是字符串

您还错过了架构的element结构,并且没有任何结构字段被称为name

解决这些问题后,请告诉我们您的错误实际是什么

答案 1 :(得分:-2)

You have to use spark-xml library to parse the XML file

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
  .option("rowTag", "book")
  .xml("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .option("rootTag", "books")
  .option("rowTag", "book")
  .xml("newbooks.xml")

Alternatively you can specify the format to use instead:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "book")
  .load("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .format("com.databricks.spark.xml")
  .option("rootTag", "books")
  .option("rowTag", "book")
  .save("newbooks.xml")

You can manually specify the schema when reading data:

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType};

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
  StructField("_id", StringType, nullable = true),
  StructField("author", StringType, nullable = true),
  StructField("description", StringType, nullable = true),
  StructField("genre", StringType ,nullable = true),
  StructField("price", DoubleType, nullable = true),
  StructField("publish_date", StringType, nullable = true),
  StructField("title", StringType, nullable = true)))


val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "book")
  .schema(customSchema)
  .load("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .format("com.databricks.spark.xml")
  .option("rootTag", "books")
  .option("rowTag", "book")
  .save("newbooks.xml")