解析xml文件时,由于spark中的类型不匹配导致无法解决爆炸问题

时间:2018-04-19 09:18:10

标签: scala apache-spark spark-dataframe apache-spark-xml

我有一个低于架构的数据框

root
 |-- DataPartition: long (nullable = true)
 |-- TimeStamp: string (nullable = true)
 |-- _organizationId: long (nullable = true)
 |-- _segmentId: long (nullable = true)
 |-- seg:BusinessSegments: struct (nullable = true)
 |    |-- seg:BusinessSegment: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _hierarchicalCode: long (nullable = true)
 |    |    |    |-- _industryId: long (nullable = true)
 |    |    |    |-- _ranking: long (nullable = true)
 |-- seg:GeographicSegments: struct (nullable = true)
 |    |-- seg:GeographicSegment: struct (nullable = true)
 |    |    |-- _geographyId: long (nullable = true)
 |    |    |-- seg:IsSubtracted: boolean (nullable = true)
 |    |    |-- seg:Sequence: long (nullable = true)
 |-- seg:IsCorporate: boolean (nullable = true)
 |-- seg:IsElimination: boolean (nullable = true)
 |-- seg:IsOperatingSegment: boolean (nullable = true)
 |-- seg:IsOther: boolean (nullable = true)
 |-- seg:IsShariaCompliant: boolean (nullable = true)
 |-- seg:PredecessorSegments: struct (nullable = true)
 |    |-- seg:PredecessorSegment: long (nullable = true)
 |-- seg:SegmentLocalLanguageLabel: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- seg:SegmentName: struct (nullable = true)
 |    |-- _VALUE: string (nullable = true)
 |    |-- _languageId: long (nullable = true)
 |-- seg:SegmentType: string (nullable = true)
 |-- seg:SegmentTypeId: long (nullable = true)
 |-- seg:ValidFromPeriodEndDate: string (nullable = true)
 |-- _action: string (nullable = true)

现在我想从架构中获取seg:BusinessSegments.seg:BusinessSegment值。

但我的问题是当我使用explode

执行此操作时
val GeographicSegmentchildDF = parentDF.select($"DataPartition".as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId", $"_segmentId", explode($"seg:GeographicSegments.seg:GeographicSegment").as("GeographicSegments"), $"_action")
val GeographicSegmentchildArrayDF = GeographicSegmentchildDF.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId".as("OrganizationId"), $"_segmentId".as("SegmentId"), $"GeographicSegments.*", getFFActionChild($"_action").as("FFAction|!|"))

所以在第一行我正在爆炸,在下一行我正在做*或扩展$"GeographicSegments.*",

我得到的错误就像 这就是我正在做的事情

  

线程“main”中的异常org.apache.spark.sql.AnalysisException:   无法解决   '爆炸(seg:GeographicSegmentsseg:GeographicSegment)'由于   数据类型不匹配:

我知道这个问题,因为在模式中我得到seg:GeographicSegment作为结构而不是数组,这就是我得到的原因。

所以真正的问题是我没有固定架构。

当xml文件中有两条记录时,seg:GeographicSegment变为数组,然后我的代码工作正常但当我只得到一条记录时,它就像struct一样工作,我的代码失败了。

我如何在代码中处理此问题。 在解析模式时是否必须放置条件? 或者无论如何我

这是一个无效的案例

val columnTypePredecessorSegments = parentDF.select($"seg:PredecessorSegments.seg:PredecessorSegment").schema.map(_.dataType).head.toString().startsWith("LongType")
    //if column type is struct then use .* and array function to convert the struct to array else just use array
    val PredecessorSegmentschildDF = if (columnTypePredecessorSegments) {
      parentDF.select($"DataPartition".as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId", $"_segmentId", explode(array($"seg:PredecessorSegments.seg:PredecessorSegment")).as("PredecessorSegments"), $"_action")
    } else {
      parentDF.select($"DataPartition".as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId", $"_segmentId", explode($"seg:PredecessorSegments.seg:PredecessorSegment").as("PredecessorSegments"), $"_action")
    }
    val PredecessorSegmentsDFFinalChilddDF = PredecessorSegmentschildDF.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId".as("OrganizationId"), $"_segmentId".as("SuccessorSegment"), $"PredecessorSegments.*", getFFActionChild($"_action").as("FFAction|!|"))
    PredecessorSegmentsDFFinalChilddDF.show(false)

1 个答案:

答案 0 :(得分:1)

  
    

当xml文件中有两条记录时,seg:GeographicSegment变为数组,然后我的代码工作正常但当我只得到一条记录时,它就像struct一样工作,我的代码失败了。

  

然后你需要在使用explode之前检查列的数据类型

//checking for struct or array type in that column
val columnType = parentDF.select($"seg:GeographicSegments.seg:GeographicSegment").schema.map(_.dataType).head.toString().startsWith("StructType")

import org.apache.spark.sql.functions._
//if column type is struct then use .* and array function to convert the struct to array else just use array
val GeographicSegmentchildDF = if(columnType) {
  parentDF.select($"DataPartition".as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId", $"_segmentId", explode(array($"seg:GeographicSegments.seg:GeographicSegment.*")).as("GeographicSegments"), $"_action")
}
else {
  parentDF.select($"DataPartition".as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId", $"_segmentId", explode($"seg:GeographicSegments.seg:GeographicSegment").as("GeographicSegments"), $"_action")
}
val GeographicSegmentchildArrayDF = GeographicSegmentchildDF.select(getDataPartition($"DataPartition").as("DataPartition"), $"TimeStamp".as("TimeStamp"), $"_organizationId".as("OrganizationId"), $"_segmentId".as("SegmentId"), $"GeographicSegments.*", getFFActionChild($"_action").as("FFAction|!|"))

我希望答案很有帮助