在spark-xml中使用嵌套父节点的自定义模式

时间:2018-03-27 18:24:29

标签: apache-spark apache-spark-sql apache-spark-dataset apache-spark-xml

我对spark-xml很新,我发现很难为我的Object准备自定义架构。请求大家帮助我。以下是我的尝试。

我正在使用Spark 1.4.7和spark-xml版本0.3.5

Test.Java

StructType customSchema = new StructType(new StructField[]{
    new StructField("id", DataTypes.StringType, true, Metadata.empty()),
    new StructField("name", DataTypes.StringType, true, Metadata.empty()),

    DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
        DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
            true)}), true)
});

final JavaRDD<Row> map = spoofRDD()
    .map(book -> RowFactory.create(
        book.getId(),
        book.getName(),
        book.getNames()));

final DataFrame df = sqlContext.createDataFrame(map, customSchema);
df.show();
df.printSchema();



private JavaRDD<Book> spoofRDD() {

Book book1 = Book.builder().id("1").name("Name1")
    .names(new String[]{"1", "2"}).build();
List<Book> books = new ArrayList<>();
books.add(book1);

return javaSparkContext.parallelize(books);
}

我的POJO课程Book.Java

private final String id;
private final String name;
private final String[] names;

我的预期 XML

<books>
<book>
    <id>1</id>
    <name>Name1</name>
    **<parent>**
        <names>1</names>
        <names>2</names>
    **</parent>**
</book>
<book>
    <id>2</id>
    <name>Name2</name>
    **<parent>**
        <names>1</names>
        <names>2</names>
    **</parent>**
</book>

因此,正如您所见,我希望在父级中有一个嵌套标记。如何修改我的customSchema以实现相同目的。

1 个答案:

答案 0 :(得分:1)

所需XML输出的正确模式是:

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- parent: struct (nullable = true)
 |    |-- names: array (nullable = true)
 |    |    |-- element: long (containsNull = true)]

,而您当前的架构是:

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- names: struct (nullable = true)
 |    |-- test: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

因此,您需要更改的唯一内容是从testnamenamesparent的代码名称以及数组内容的值类型。

new StructType(new StructField[]{
  new StructField("id", DataTypes.StringType, true, Metadata.empty()),
  new StructField("name", DataTypes.StringType, true, Metadata.empty()),

  DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
    DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
        true)}), true)
})

真正的问题是数据。由于parent必须为structgetNames输出应包含Row

.map(book -> RowFactory.create(
    book.getId(),
    book.getName(),
    RowFactory.create(book.getNames())));