我对spark-xml很新,我发现很难为我的Object准备自定义架构。请求大家帮助我。以下是我的尝试。
我正在使用Spark 1.4.7和spark-xml版本0.3.5
Test.Java
StructType customSchema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
true)}), true)
});
final JavaRDD<Row> map = spoofRDD()
.map(book -> RowFactory.create(
book.getId(),
book.getName(),
book.getNames()));
final DataFrame df = sqlContext.createDataFrame(map, customSchema);
df.show();
df.printSchema();
private JavaRDD<Book> spoofRDD() {
Book book1 = Book.builder().id("1").name("Name1")
.names(new String[]{"1", "2"}).build();
List<Book> books = new ArrayList<>();
books.add(book1);
return javaSparkContext.parallelize(books);
}
我的POJO课程Book.Java
private final String id;
private final String name;
private final String[] names;
我的预期 XML
<books>
<book>
<id>1</id>
<name>Name1</name>
**<parent>**
<names>1</names>
<names>2</names>
**</parent>**
</book>
<book>
<id>2</id>
<name>Name2</name>
**<parent>**
<names>1</names>
<names>2</names>
**</parent>**
</book>
因此,正如您所见,我希望在父级中有一个嵌套标记。如何修改我的customSchema以实现相同目的。
答案 0 :(得分:1)
所需XML输出的正确模式是:
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- parent: struct (nullable = true)
| |-- names: array (nullable = true)
| | |-- element: long (containsNull = true)]
,而您当前的架构是:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- names: struct (nullable = true)
| |-- test: array (nullable = true)
| | |-- element: string (containsNull = true)
因此,您需要更改的唯一内容是从test
到name
和names
到parent
的代码名称以及数组内容的值类型。
new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
true)}), true)
})
真正的问题是数据。由于parent
必须为struct
,getNames
输出应包含Row
:
.map(book -> RowFactory.create(
book.getId(),
book.getName(),
RowFactory.create(book.getNames())));