我有一个以下架构
root
|-- DataPartition: long (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- _action: string (nullable = true)
|-- env:Data: struct (nullable = true)
| |-- _type: string (nullable = true)
| |-- al:FundamentalAnalytic: struct (nullable = true)
| | |-- _analyticItemInstanceKey: long (nullable = true)
| | |-- _financialPeriodEndDate: string (nullable = true)
| | |-- _financialPeriodType: string (nullable = true)
| | |-- _isYearToDate: boolean (nullable = true)
| | |-- _lineItemId: long (nullable = true)
| | |-- al:AnalyticConceptCode: string (nullable = true)
| | |-- al:AnalyticConceptId: long (nullable = true)
| | |-- al:AnalyticIsEstimated: boolean (nullable = true)
| | |-- al:AnalyticValue: struct (nullable = true)
| | | |-- _VALUE: double (nullable = true)
| | | |-- _currencyId: long (nullable = true)
| | |-- al:AuditID: string (nullable = true)
| | |-- al:FinancialPeriodTypeId: long (nullable = true)
| | |-- al:FundamentalSeriesId: struct (nullable = true)
| | | |-- _VALUE: long (nullable = true)
| | | |-- _objectType: string (nullable = true)
| | | |-- _objectTypeId: long (nullable = true)
| | |-- al:InstrumentId: long (nullable = true)
| | |-- al:IsAnnual: boolean (nullable = true)
| | |-- al:TaxonomyId: long (nullable = true)
现在这是一个经常变化的xml文件。 我只想处理包含 env:Data.sr:Source。* 的税 为此我写了下面的代码
val dfType = dfContentItem.
select(getDataPartition($"DataPartition").
as("DataPartition"),
$"TimeStamp".as("TimeStamp"),
$"env:Data.sr:Source.*",
getFFActionParent($"_action")
.as("FFAction|!|")
).filter($"env:Data.sr:Source._organizationId".isNotNull)
dfType.show(false)
但这仅适用于在架构中找到sr:Source
的情况,否则我会遇到异常
线程“main”
org.apache.spark.sql.AnalysisException
中的异常:否 这样的struct field sr:_type中的Source,cr:TRFCoraxData,fun:Fundamental, md:Identifier, md:Relationship
;
要忽略我有sr:Source
的空检查,但这对我不起作用。
对于那个检查,我也得到同样的错误。
基本上我需要的是env:Data.sr:Source。*为null然后我想退出处理,下一个标签处理将重新开始。
答案 0 :(得分:0)
当查询出错时,org.apache.spark.sql.AnalysisException
通常会被抛出 - 所以我很确定这是因为你在这些场合试图过滤null
scala
中的错误处理通常使用Option
来完成good article on it
试试
def handleNulls(organizationId: String): Option[Boolean] = {
val orgId = Option(organizationId).getOrElse(return None)
Some()
}
val betterNullsUdf = udf[Option[Boolean], Integer](handleNulls)
val dfType = dfContentItem.
select(getDataPartition($"DataPartition").
as("DataPartition"),
$"TimeStamp".as("TimeStamp"),
betterNullsUdf($"env:Data.sr:Source.*"),
getFFActionParent($"_action")
.as("FFAction|!|")
).filter($"env:Data.sr:Source._organizationId".isNotNull)
dfType.show(false)