使用*来展平struct列

Question

我在Scala / Spark的行级进行一些计算。我有一个使用JSON创建的数据框 -

{"available":false,"createTime":"2016-01-08","dataValue":{"names_source":{"first_names":["abc", "def"],"last_names_id":[123,456]},"another_source_array":[{"first":"1.1","last":"ONE"}],"another_source":"TableSources","location":"GMP", "timestamp":"2018-02-11"},"deleteTime":"2016-01-08"}

您可以直接使用此JSON创建数据框。我的架构如下所示 -

root
 |-- available: boolean (nullable = true)
 |-- createTime: string (nullable = true)
 |-- dataValue: struct (nullable = true)
 |    |-- another_source: string (nullable = true)
 |    |-- another_source_array: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- first: string (nullable = true)
 |    |    |    |-- last: string (nullable = true)
 |    |-- location: string (nullable = true)
 |    |-- names_source: struct (nullable = true)
 |    |    |-- first_names: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- last_names_id: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |-- timestamp: string (nullable = true)
 |-- deleteTime: string (nullable = true)

我正在使用readSchema单独阅读所有列并使用writeSchema进行编写。在两个复杂的列中，我能够处理一个而不是其他列。

以下是我阅读架构的一部分 -

  .add("names_source", StructType(
      StructField("first_names", ArrayType.apply(StringType)) ::
        StructField("last_names_id", ArrayType.apply(DoubleType)) ::
        Nil
))
.add("another_source_array", ArrayType(StructType(
      StructField("first", StringType) ::
        StructField("last", StringType) ::
        Nil
)))

这是我的写模式的一部分 -

.add("names_source", StructType.apply(Seq(
           StructField("first_names", StringType),
            StructField("last_names_id", DoubleType))
  ))
  .add("another_source_array", ArrayType(StructType.apply(Seq(
           StructField("first", StringType),
            StructField("last", StringType))
   )))

在处理中，我使用的方法来索引所有列。下面是我的函数代码 -

   def myMapRedFunction(df: DataFrame, spark: SparkSession): DataFrame = {

    val columnIndex = dataIndexingSchema.fieldNames.zipWithIndex.toMap

    val myRDD = df.rdd
      .map(row => {
      Row(
        row.getAs[Boolean](columnIndex("available")),
        parseDate(row.getAs[String](columnIndex("create_time"))),
        ??I Need help here??
        row.getAs[String](columnIndex("another_source")),
        anotherSourceArrayFunction(row.getSeq[Row](columnIndex("another_source_array"))),
        row.getAs[String](columnIndex("location")),
        row.getAs[String](columnIndex("timestamp")),
        parseDate(row.getAs[String](columnIndex("delete_time")))
      )
    }).distinct

    spark.createDataFrame(myRDD, dataWriteSchema)
  }

another_source_array方法正在处理

anotherSourceArrayFunction列，以确保我们根据要求获取架构。我需要一个类似的函数来获取names_source列。以下是我用于another_source_array列的功能。

    def anotherSourceArrayFunction(data: Seq[Row]): Seq[Row] = {
    if (data == null) {
      data
    } else {
      data.map(r => {
        val first = r.getAs[String]("first").ToUpperCase()
        val last = r.getAs[String]("last")
        new GenericRowWithSchema(Array(first,last), StructType(
          StructField("first", StringType) ::
            StructField("last", StringType) ::
            Nil
        ))
      })
    }
  }

可能简而言之，我需要类似this的内容，我可以将names_source列结构作为结构。

names_source:struct<first_names:array<string>,last_names_id:array<bigint>>
another_source_array:array<struct<first:string,last:string>>

以上是最终需要的列模式。我能够another_source_array正确地获得names_source的帮助。我认为这个专栏的写模式是正确的，但我不确定。但我最终需要names_source:struct<first_names:array<string>,last_names_id:array<bigint>>作为列模式。

注意：我可以毫不费力地获得another_source_array列。我保留了这个功能，以便更好地理解。

Answer 1

从我在您尝试过的所有代码中看到的是您试图将struct dataValue列展平为单独的列。

如果我的假设是正确的，那么你就不必经历这种复杂性。您可以简单地执行以下操作

val myRDD = df.rdd
  .map(row => {
    Row(
      row.getAs[Boolean]("available"),
      parseDate(row.getAs[String]("createTime")),
      row.getAs[Row]("dataValue").getAs[Row]("names_source"),
      row.getAs[Row]("dataValue").getAs[String]("another_source"),
      row.getAs[Row]("dataValue").getAs[Seq[Row]]("another_source_array"),
      row.getAs[Row]("dataValue").getAs[String]("location"),
      row.getAs[Row]("dataValue").getAs[String]("timestamp"),
      parseDate(row.getAs[String]("deleteTime"))
    )
  }).distinct

import org.apache.spark.sql.types._

val dataWriteSchema = StructType(Seq(
  StructField("createTime", DateType, true),
  StructField("createTime", StringType, true),
  StructField("names_source", StructType(Seq(StructField("first_names", ArrayType(StringType), true), StructField("last_names_id", ArrayType(LongType), true))), true),
  StructField("another_source", StringType, true),
  StructField("another_source_array", ArrayType(StructType.apply(Seq(StructField("first", StringType),StructField("last", StringType)))), true),
  StructField("location", StringType, true),
  StructField("timestamp", StringType, true),
  StructField("deleteTime", DateType, true)
))

spark.createDataFrame(myRDD, dataWriteSchema).show(false)

使用*来展平struct列

您可以在 struct column 上使用.*，使struct column的元素位于不同的列上

import org.apache.spark.sql.functions._
df.select(col("available"), col("createTime"), col("dataValue.*"), col("deleteTime")).show(false)

您将必须在此方法中将字符串日期列更改为dateType

在这两种情况下，你都会得到输出

+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|available|createTime|names_source                                   |another_source|another_source_array|location|timestamp |deleteTime|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+
|false    |2016-01-08|[WrappedArray(abc, def),WrappedArray(123, 456)]|TableSources  |[[1.1,ONE]]         |GMP     |2018-02-11|2016-01-08|
+---------+----------+-----------------------------------------------+--------------+--------------------+--------+----------+----------+

我希望答案很有帮助

使用任何函数

1 个答案:

使用*来展平struct列