这是一个有效的Spark Schema吗?

时间:2018-01-20 01:32:11

标签: scala apache-spark schema

当我遇到一个边缘案例时,我正在使用建议的here方法展开Spark Schema -

val writerSchema = StructType(Seq(
      StructField("f1", ArrayType(ArrayType(
        StructType(Seq(
          StructField("f2", ArrayType(LongType))
        ))
      )))
    ))

writerSchema.printTreeString()

root
 |-- f1: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- f2: array (nullable = true)
 |    |    |    |    |-- element: long (containsNull = true)

这将打印以下输出 - f1而不是

f1
f1.f2
正如我所预期的那样。

问题 -

  1. writerSchema是否为有效的Spark模式?
  2. 展平架构时,如何处理ArrayType个对象?

1 个答案:

答案 0 :(得分:0)

如果你想处理这样的数据

val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""

有效架构将是

val writerSchema = StructType(Seq(
    StructField("f1", ArrayType(
      StructType(Seq(
        StructField("f2", ArrayType(LongType))
      ))
    ))))

root
 |-- f1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- f2: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)

您不应该将ArrayType放在另一个ArrayType中。

因此,假设您有一个dataframe inputDF:

inputDF.printSchema
root
 |-- f1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- f2: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)

inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1                                                                                                     |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+

为了展平这个数据帧,我们可以分解数组列(f1和f2):

首先,展平专栏' f1'

val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))

semiFlattenDF.printSchema
root
 |-- f2: array (nullable = true)
 |    |-- element: long (containsNull = true)

semiFlattenDF.show
+------------+
|          f2|
+------------+
|   [1, 2, 3]|
|   [4, 5, 6]|
|   [7, 8, 9]|
|[10, 11, 12]|
+------------+

现在展平专栏' f2'并将列名称作为'值'

val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))

所以现在DataFrame被夷为平地:

fullyFlattenDF.printSchema
root
 |-- value: long (nullable = true)

fullyFlattenDF.show
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
|   10|
|   11|
|   12|
+-----+