在Scala中更改任何Spark sql StructType的所有元素的nullable属性的常用方法

时间:2017-10-04 23:26:57

标签: scala apache-spark apache-spark-sql

是否有一个通用方法来更改任何指定StructType的所有元素的可空属性?它可能是嵌套的StructType。

我看到@eliasah用Spark Dataframe column nullable property change标记为重复。但它们是不同的,因为它无法解决层次结构/嵌套的StructType,该答案仅适用于一个级别。

例如:

 root
 |-- user_id: string (nullable = false)
 |-- name: string (nullable = false)
 |-- system_process: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp: long (nullable = false)
 |    |    |-- process: string (nullable = false)
 |-- type: string (nullable = false)
 |-- user_process: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- timestamp: long (nullable = false)
 |    |    |-- process: string (nullable = false)

我想将nullalbe更改为true,所有元素的结果应为:

 root
 |-- user_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- system_process: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- process: string (nullable = true)
 |-- type: string (nullable = true)
 |-- user_process: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- timestamp: long (nullable = true)
 |    |    |-- process: string (nullable = true)

附件是StructType的JSON模式样本,用于方便测试:

val jsonSchema="""{"type":"struct","fields":[{"name":"user_id","type":"string","nullable":false,"metadata":{}},{"name":"name","type":"string","nullable":false,"metadata":{}},{"name":"system_process","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"timestamp","type":"long","nullable":false,"metadata":{}},{"name":"process_id","type":"string","nullable":false,"metadata":{}}]},"containsNull":false},"nullable":false,"metadata":{}},{"name":"type","type":"string","nullable":false,"metadata":{}},{"name":"user_process","type":{"type":"array","elementType":{"type":"struct","fields":[{"name":"timestamp","type":"long","nullable":false,"metadata":{}},{"name":"process_id","type":"string","nullable":false,"metadata":{}}]},"containsNull":false},"nullable":false,"metadata":{}}]}"""
DataType.fromJson(jsonSchema).asInstanceOf[StructType].printTreeString()

1 个答案:

答案 0 :(得分:1)

最后找出两个解决方案如下:

  1. 首先尝试替换字符串,然后从JSON字符串

    创建StructType实例
    DataType.fromJson(schema.json.replaceAll("\"nullable\":false", "\"nullable\":true")).asInstanceOf[StructType]
    
  2. 反复出现的方法

      def updateFieldsToNullable(structType: StructType): StructType = {
        StructType(structType.map(f => f.dataType match {
          case d: ArrayType =>
            val element = d.elementType match {
              case s: StructType => updateFieldsToNullable(s)
              case _ => d.elementType
            }
            f.copy(nullable = true, dataType = ArrayType(element, d.containsNull))
          case s: StructType => f.copy(nullable = true, dataType = updateFieldsToNullable(s))
          case _ => f.copy(nullable = true)
        })
        )
      }