在spark job scala中添加新列之前检查null条件

时间:2018-04-10 03:12:56

标签: scala apache-spark apache-spark-xml

我有一个以下架构

root
 |-- DataPartition: long (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- _action: string (nullable = true)
 |-- env:Data: struct (nullable = true)
 |    |-- _type: string (nullable = true)
 |    |-- al:FundamentalAnalytic: struct (nullable = true)
 |    |    |-- _analyticItemInstanceKey: long (nullable = true)
 |    |    |-- _financialPeriodEndDate: string (nullable = true)
 |    |    |-- _financialPeriodType: string (nullable = true)
 |    |    |-- _isYearToDate: boolean (nullable = true)
 |    |    |-- _lineItemId: long (nullable = true)
 |    |    |-- al:AnalyticConceptCode: string (nullable = true)
 |    |    |-- al:AnalyticConceptId: long (nullable = true)
 |    |    |-- al:AnalyticIsEstimated: boolean (nullable = true)
 |    |    |-- al:AnalyticValue: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _currencyId: long (nullable = true)
 |    |    |-- al:AuditID: string (nullable = true)
 |    |    |-- al:FinancialPeriodTypeId: long (nullable = true)
 |    |    |-- al:FundamentalSeriesId: struct (nullable = true)
 |    |    |    |-- _VALUE: long (nullable = true)
 |    |    |    |-- _objectType: string (nullable = true)
 |    |    |    |-- _objectTypeId: long (nullable = true)
 |    |    |-- al:InstrumentId: long (nullable = true)
 |    |    |-- al:IsAnnual: boolean (nullable = true)
 |    |    |-- al:TaxonomyId: long (nullable = true)

现在这是一个经常变化的xml文件。 我只想处理包含 env:Data.sr:Source。* 的税 为此我写了下面的代码

val dfType = dfContentItem.
    select(getDataPartition($"DataPartition").
        as("DataPartition"), 
        $"TimeStamp".as("TimeStamp"), 
        $"env:Data.sr:Source.*", 
        getFFActionParent($"_action")
        .as("FFAction|!|")
    ).filter($"env:Data.sr:Source._organizationId".isNotNull)
dfType.show(false)

但这仅适用于在架构中找到sr:Source的情况,否则我会遇到异常

  

线程“main”org.apache.spark.sql.AnalysisException中的异常:否   这样的struct field sr:_type中的Source,cr:TRFCoraxData,   fun:Fundamental, md:Identifier, md:Relationship;

要忽略我有sr:Source的空检查,但这对我不起作用。 对于那个检查,我也得到同样的错误。

基本上我需要的是env:Data.sr:Source。*为null然后我想退出处理,下一个标签处理将重新开始。

1 个答案:

答案 0 :(得分:0)

当查询出错时,org.apache.spark.sql.AnalysisException通常会被抛出 - 所以我很确定这是因为你在这些场合试图过滤null

scala中的错误处理通常使用Option来完成good article on it 试试

def handleNulls(organizationId: String): Option[Boolean] = {
     val orgId = Option(organizationId).getOrElse(return None)
     Some()
}
val betterNullsUdf = udf[Option[Boolean], Integer](handleNulls)

val dfType = dfContentItem.
    select(getDataPartition($"DataPartition").
        as("DataPartition"), 
        $"TimeStamp".as("TimeStamp"), 
        betterNullsUdf($"env:Data.sr:Source.*"), 
        getFFActionParent($"_action")
        .as("FFAction|!|")
    ).filter($"env:Data.sr:Source._organizationId".isNotNull)
dfType.show(false)