Pyspark XML处理-架构类型错误

时间:2019-04-08 16:52:02

标签: xml pyspark azure-databricks

我正在使用Spark XML库(HyukjinKwon:spark-xml:0.1.1-s_2.11)处理一个大型XML文件。 XML处理失败,并带有几个记录的分析异常。我收到带有以下示例记录的“模式选择”的分析异常。

我有以下代码用于处理生成的xml和Schema。由于输入xml有时可能会像示例2一样输出,因此我在“数据框”上的“选择”失败,并出现“分析异常”。

Sample 1:- Works Fine

XML 1:
<AuthorList CompleteYN="Y"> 
                 <Author ValidYN="Y">
                    <LastName>H</LastName>
                    <ForeName>L</ForeName>
                    <Initials>L</Initials>
                    <AffiliationInfo>
                        <Affiliation>Aff1</Affiliation>
                    </AffiliationInfo>
                    <AffiliationInfo>
                        <Affiliation>Aff2</Affiliation>
                    </AffiliationInfo>
                </Author>               
</AuthorList>

Schema: 
root
 -- AuthorList: struct (nullable = true)
 |    |    |    |-- Author: struct (nullable = true)
 |    |    |    |    |-- AffiliationInfo: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- Affiliation: string (nullable = true)


Sample2:- Doesn't Work
XML 2:  
<AuthorList CompleteYN="Y"> 
                 <Author ValidYN="Y">
                    <LastName>H</LastName>
                    <ForeName>L</ForeName>
                    <Initials>L</Initials>
                    <AffiliationInfo>
                        <Affiliation>Aff1</Affiliation>
                    </AffiliationInfo>
                    <AffiliationInfo>
                        <Affiliation>Aff2</Affiliation>
                    </AffiliationInfo>
                </Author>       
                 <Author ValidYN="Y">
                    <LastName>H</LastName>
                    <ForeName>L</ForeName>
                    <Initials>L</Initials>
                    <AffiliationInfo>
                        <Affiliation>Aff4</Affiliation>
                    </AffiliationInfo>                   
                </Author>               
</AuthorList>

Schema: 

root
 |-- AuthorList: struct (nullable = true)
 |    |    |    |-- Author: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- AffiliationInfo: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- Affiliation: string (nullable = true)


Exception: AnalysisException: "cannot resolve '`AuthorList`.`Author`.`AffiliationInfo`['Affiliation']' due to data type mismatch: argument 2 requires integral type, however, ''Affiliation'' is of string type.;;

我想生成/更改可以支持XML2记录的模式。不太确定生成的架构出了什么问题。感谢输入。

0 个答案:

没有答案