当我遇到一个边缘案例时,我正在使用建议的here方法展开Spark Schema -
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
)))
))
writerSchema.printTreeString()
root
|-- f1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- f2: array (nullable = true)
| | | | |-- element: long (containsNull = true)
这将打印以下输出 - f1
而不是
f1
f1.f2
正如我所预期的那样。
问题 -
writerSchema
是否为有效的Spark模式?ArrayType
个对象?答案 0 :(得分:0)
如果你想处理这样的数据
val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""
有效架构将是
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
))))
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
您不应该将ArrayType放在另一个ArrayType中。
因此,假设您有一个dataframe inputDF:
inputDF.printSchema
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1 |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+
为了展平这个数据帧,我们可以分解数组列(f1和f2):
首先,展平专栏' f1'
val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))
semiFlattenDF.printSchema
root
|-- f2: array (nullable = true)
| |-- element: long (containsNull = true)
semiFlattenDF.show
+------------+
| f2|
+------------+
| [1, 2, 3]|
| [4, 5, 6]|
| [7, 8, 9]|
|[10, 11, 12]|
+------------+
现在展平专栏' f2'并将列名称作为'值'
val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))
所以现在DataFrame被夷为平地:
fullyFlattenDF.printSchema
root
|-- value: long (nullable = true)
fullyFlattenDF.show
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
+-----+