读取以“ />”结尾的标签时,Databricks spark-xml返回值为空

时间:2018-07-11 18:21:46

标签: apache-spark xml-parsing databricks

我在scala 11中使用的是最新版本的spark-xml(0.4.1),当我读取一些包含以“ />”结尾的标签的xml时,相应的值为null,请参见以下示例: / p>

XML:

<Clients>
    <Client ID="1" name="teste1" age="10">
        <Operation ID="1" name="operation1">
        </Operation>
        <Operation ID="2" name="operation2">
        </Operation>
    </Client>
    <Client ID="2" name="teste2" age="20"/>
    <Client ID="3" name="teste3" age="30">
        <Operation ID="1" name="operation1">
        </Operation>
        <Operation ID="2" name="operation2">
        </Operation>
    </Client>
</Clients>

数据框:

+----+------+----+--------------------+
| _ID| _name|_age|           Operation|
+----+------+----+--------------------+
|   1|teste1|  10|[[1,operation1], ...|
|null|  null|null|                null|
+----+------+----+--------------------+

代码:

Dataset<Row> clients = sparkSession.sqlContext().read()
                  .format("com.databricks.spark.xml")
                  .option("rowTag", "Client")
                  .schema(getSchemaClient())
                  .load(dirtorio);

        clients.show(10);

public StructType getSchemaClient() {
        return new StructType(
                new StructField[] { 
                        new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("_age", DataTypes.StringType, true, Metadata.empty()),
                        new StructField("Operation", DataTypes.createArrayType(this.getSchemaOperation()), true, Metadata.empty()) });
    }

    public StructType getSchemaOperation() {
        return new StructType(new StructField[] {
                new StructField("_ID", DataTypes.StringType, true, Metadata.empty()),
                new StructField("_name", DataTypes.StringType, true, Metadata.empty()),
        });
    }

1 个答案:

答案 0 :(得分:0)

版本0.5.0刚刚发布,该版本解决了自动关闭标签的问题。它可以解决此问题。参见https://github.com/databricks/spark-xml/pull/352