我正在尝试为spark加载xml文件编写自定义架构。在我的情况下,我需要访问两个标签,即us-provisional-application
标签下的us-related-documents
和 us-related-documents: struct (nullable = true)
|
| |-- related-publication: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- document-id: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- date: long (nullable = true)
| | | | |-- doc-number: long (nullable = true)
| | | | |-- kind: string (nullable = true)
| |-- us-provisional-application: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- document-id: struct (nullable = true)
| | | | |-- country: string (nullable = true)
| | | | |-- date: long (nullable = true)
| | | | |-- doc-number: long (nullable = true)
这是我的数据结构:
StructField("us-related-documents", StructType(
List(StructField("us-provisional-application",StructType(
List(StructField("document-id",
ArrayType(StructType(
List(
StructField("doc-number", StringType, nullable = true),
StructField("country", StringType, nullable = true),
StructField("kind", StringType, nullable = true),
StructField("date", LongType, nullable = true),
StructField("name", StringType, nullable = true)
)
))))
)))
))
以下是我访问一个代码的代码:
GET / HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0
当我尝试向上述架构添加另一个标签时,它失败了 如何访问剩余标签?
答案 0 :(得分:1)
以下是我访问一个代码的代码
首先,这是数据集架构,而不是访问任何内容的代码。您应该使用Dataset对象的select()
方法或SparkSQL来访问数据。
例如
df.select($"us-related-documents.related-publication")
我不确定您为什么要重建架构的StructType对象,但该代码与打印的架构不匹配...
us-provisional-application
必须是ArrayType
,而不是StructType
。
document-id
是结构,而不是数组
您的Scala代码无论如何都更接近related-publication
结构,所以这可能是第一个错误。
doc-number
是一个很长的字符串,而不是字符串
您还错过了架构的element
结构,并且没有任何结构字段被称为name
解决这些问题后,请告诉我们您的错误实际是什么
答案 1 :(得分:-2)
You have to use spark-xml library to parse the XML file
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.option("rowTag", "book")
.xml("books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.option("rootTag", "books")
.option("rowTag", "book")
.xml("newbooks.xml")
Alternatively you can specify the format to use instead:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.save("newbooks.xml")
You can manually specify the schema when reading data:
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, DoubleType};
val sqlContext = new SQLContext(sc)
val customSchema = StructType(Array(
StructField("_id", StringType, nullable = true),
StructField("author", StringType, nullable = true),
StructField("description", StringType, nullable = true),
StructField("genre", StringType ,nullable = true),
StructField("price", DoubleType, nullable = true),
StructField("publish_date", StringType, nullable = true),
StructField("title", StringType, nullable = true)))
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.schema(customSchema)
.load("books.xml")
val selectedData = df.select("author", "_id")
selectedData.write
.format("com.databricks.spark.xml")
.option("rootTag", "books")
.option("rowTag", "book")
.save("newbooks.xml")