Question

我正在尝试为spark加载xml文件编写自定义架构。在我的情况下，我需要访问两个标签，即us-provisional-application标签下的us-related-documents和us-related-documents: struct (nullable = true) | | |-- related-publication: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- document-id: struct (nullable = true) | | | | |-- country: string (nullable = true) | | | | |-- date: long (nullable = true) | | | | |-- doc-number: long (nullable = true) | | | | |-- kind: string (nullable = true) | |-- us-provisional-application: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- document-id: struct (nullable = true) | | | | |-- country: string (nullable = true) | | | | |-- date: long (nullable = true) | | | | |-- doc-number: long (nullable = true)

这是我的数据结构：

    StructField("us-related-documents", StructType(
List(StructField("us-provisional-application",StructType(
List(StructField("document-id",
ArrayType(StructType(
List(
StructField("doc-number", StringType, nullable = true),
StructField("country", StringType, nullable = true),
StructField("kind", StringType, nullable = true),
StructField("date", LongType, nullable = true),
StructField("name", StringType, nullable = true)
)
))))
)))
))

以下是我访问一个代码的代码：

GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0

当我尝试向上述架构添加另一个标签时，它失败了如何访问剩余标签？

Answer 1

以下是我访问一个代码的代码

首先，这是数据集架构，而不是访问任何内容的代码。您应该使用Dataset对象的select()方法或SparkSQL来访问数据。

例如

df.select($"us-related-documents.related-publication")

我不确定您为什么要重建架构的StructType对象，但该代码与打印的架构不匹配...

us-provisional-application必须是ArrayType，而不是StructType。

document-id是结构，而不是数组

您的Scala代码无论如何都更接近related-publication结构，所以这可能是第一个错误。

doc-number是一个很长的字符串，而不是字符串

您还错过了架构的element结构，并且没有任何结构字段被称为name

解决这些问题后，请告诉我们您的错误实际是什么

Answer 2

You have to use spark-xml library to parse the XML file

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
  .option("rowTag", "book")
  .xml("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .option("rootTag", "books")
  .option("rowTag", "book")
  .xml("newbooks.xml")

Alternatively you can specify the format to use instead:

import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "book")
  .load("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .format("com.databricks.spark.xml")
  .option("rootTag", "books")
  .option("rowTag", "book")
  .save("newbooks.xml")

You can manually specify the schema when reading data:

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType};

val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
  StructField("_id", StringType, nullable = true),
  StructField("author", StringType, nullable = true),
  StructField("description", StringType, nullable = true),
  StructField("genre", StringType ,nullable = true),
  StructField("price", DoubleType, nullable = true),
  StructField("publish_date", StringType, nullable = true),
  StructField("title", StringType, nullable = true)))


val df = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "book")
  .schema(customSchema)
  .load("books.xml")

val selectedData = df.select("author", "_id")
selectedData.write
  .format("com.databricks.spark.xml")
  .option("rootTag", "books")
  .option("rowTag", "book")
  .save("newbooks.xml")

用于spark中xml处理的复杂自定义模式

2 个答案: