使用Scala从Spark 2.2中的StructType数组重命名嵌套字段

时间:2018-10-30 08:50:39

标签: scala apache-spark apache-spark-sql user-defined-functions

我正在尝试重命名结构类型数组中的字段。以下是我拥有的架构和我想要的架构。仅提供模式的一部分,它具有n个其他列。

Input schema
 |-- n other columns
 |-- state_playback_segmentInfo: array (nullable = true)  
 |    |-- element: struct (containsNull = true)
 |    |    |-- isAd: boolean (nullable = true)
 |    |    |-- queryParameters: string (nullable = true)
 |    |    |-- sequenceNumber: integer (nullable = true)
 |    |    |-- segmentUrl: string (nullable = true)
 |    |    |-- sizeBytes: integer (nullable = true)
 |    |    |-- downloadDurationMs: integer (nullable = true)
 |    |    |-- ipAddress: string (nullable = true)
 |    |    |-- location: string (nullable = true)

Output Schema
 |-- n other columns
 |-- state__playback__segmentInfo: array (nullable = true)  
 |    |-- element: struct (containsNull = true)
 |    |    |-- state__playback__segmentInfo__isAd: boolean (nullable = true)
 |    |    |-- state__playback__segmentInfo__queryParameters: string (nullable = true)
 |    |    |-- state__playback__segmentInfo__sequenceNumber: integer (nullable = true)
 |    |    |-- state__playback__segmentInfo__segmentUrl: string (nullable = true)
 |    |    |-- state__playback__segmentInfo__sizeBytes: integer (nullable = true)
 |    |    |-- state__playback__segmentInfo__downloadDurationMs: integer (nullable = true)
 |    |    |-- state__playback__segmentInfo__ipAddress: string (nullable = true)
 |    |    |-- state__playback__segmentInfo__location: string (nullable = true)

我已经创建了用于平整StructType字段的嵌套DF的函数,请查看下面的代码。

def flattenDF(schema: StructType, delimeter:String, prefix: String): Array[Column] = {
    schema.fields.flatMap(structField => {
      val codeColName = if (prefix == null) structField.name else prefix + "." + structField.name
      val colName = if (prefix == null) structField.name else prefix + delimeter + structField.name

      structField.dataType match {
        case st: StructType => flattenDF(schema = st, delimeter = delimeter, prefix = colName)
        case _ => Array(col(codeColName).alias(colName))
      }
    })
  }

帮助处理这种情况或推荐任何参考。

0 个答案:

没有答案