我正在使用Spark XML库(HyukjinKwon:spark-xml:0.1.1-s_2.11)处理一个大型XML文件。 XML处理失败,并带有几个记录的分析异常。我收到带有以下示例记录的“模式选择”的分析异常。
我有以下代码用于处理生成的xml和Schema。由于输入xml有时可能会像示例2一样输出,因此我在“数据框”上的“选择”失败,并出现“分析异常”。
Sample 1:- Works Fine
XML 1:
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>H</LastName>
<ForeName>L</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Aff1</Affiliation>
</AffiliationInfo>
<AffiliationInfo>
<Affiliation>Aff2</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
Schema:
root
-- AuthorList: struct (nullable = true)
| | | |-- Author: struct (nullable = true)
| | | | |-- AffiliationInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- Affiliation: string (nullable = true)
Sample2:- Doesn't Work
XML 2:
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>H</LastName>
<ForeName>L</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Aff1</Affiliation>
</AffiliationInfo>
<AffiliationInfo>
<Affiliation>Aff2</Affiliation>
</AffiliationInfo>
</Author>
<Author ValidYN="Y">
<LastName>H</LastName>
<ForeName>L</ForeName>
<Initials>L</Initials>
<AffiliationInfo>
<Affiliation>Aff4</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
Schema:
root
|-- AuthorList: struct (nullable = true)
| | | |-- Author: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- AffiliationInfo: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- Affiliation: string (nullable = true)
Exception: AnalysisException: "cannot resolve '`AuthorList`.`Author`.`AffiliationInfo`['Affiliation']' due to data type mismatch: argument 2 requires integral type, however, ''Affiliation'' is of string type.;;
我想生成/更改可以支持XML2记录的模式。不太确定生成的架构出了什么问题。感谢输入。