Question

当我遇到一个边缘案例时，我正在使用建议的here方法展开Spark Schema -

val writerSchema = StructType(Seq(
      StructField("f1", ArrayType(ArrayType(
        StructType(Seq(
          StructField("f2", ArrayType(LongType))
        ))
      )))
    ))

writerSchema.printTreeString()

root
 |-- f1: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- f2: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)

这将打印以下输出 - f1而不是

f1
f1.f2

正如我所预期的那样。

问题 -

writerSchema是否为有效的Spark模式？
展平架构时，如何处理ArrayType个对象？

Answer 1

如果你想处理这样的数据

val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""

有效架构将是

val writerSchema = StructType(Seq(
    StructField("f1", ArrayType(
      StructType(Seq(
        StructField("f2", ArrayType(LongType))
      ))
    ))))

root
 |-- f1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- f2: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)

您不应该将ArrayType放在另一个ArrayType中。

因此，假设您有一个dataframe inputDF：

inputDF.printSchema
root
 |-- f1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- f2: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)

inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1                                                                                                     |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+

为了展平这个数据帧，我们可以分解数组列（f1和f2）：

首先，展平专栏＆＃39; f1＆＃39;

val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))

semiFlattenDF.printSchema
root
 |-- f2: array (nullable = true)
 |    |-- element: long (containsNull = true)

semiFlattenDF.show
+------------+
|          f2|
+------------+
|   [1, 2, 3]|
|   [4, 5, 6]|
|   [7, 8, 9]|
|[10, 11, 12]|
+------------+

现在展平专栏＆＃39; f2＆＃39;并将列名称作为＆＃39;值＆＃39;

val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))

所以现在DataFrame被夷为平地：

fullyFlattenDF.printSchema
root
 |-- value: long (nullable = true)

fullyFlattenDF.show
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
+-----+

这是一个有效的Spark Schema吗？

1 个答案: