Question

我有一个类似于df.printSchema的数据框：

root
|-- ts: timestamp (nullable = true)
|-- geoip: struct (nullable = true)
|    |-- city: string (nullable = true)
|    |-- continent: string (nullable = true)
|    |-- location: struct (nullable = true)
|    |    |-- lat: float (nullable = true)
|    |    |-- lon: float (nullable = true)

我知道，例如df = df.withColumn("error", lit(null).cast(StringType))我可以在null下方添加名为error String的{{1}}字段。如何在root结构下或geoip结构下添加相同的字段？

我也试过location而没有运气。

Answer 1

TL; DR 您必须以某种方式映射数据集中的行。

map Operator（最灵活）

使用map操作，因为您可以完全控制行的最终结构，因此可以提供最大的灵活性。

map [U]（func：（T）⇒U）（隐式arg0：编码器[U]）：数据集[U] （Scala特定）返回包含结果的新数据集将func应用于每个元素。

您的案例如下：

// Create a sample dataset to work with
scala> val df = Seq("timestamp").
  toDF("ts").
  withColumn("geoip", struct(lit("Warsaw") as "city", lit("Europe") as "continent"))
df: org.apache.spark.sql.DataFrame = [ts: string, geoip: struct<city: string, continent: string>]

scala> df.show
+---------+---------------+
|       ts|          geoip|
+---------+---------------+
|timestamp|[Warsaw,Europe]|
+---------+---------------+

scala> df.printSchema
root
 |-- ts: string (nullable = true)
 |-- geoip: struct (nullable = false)
 |    |-- city: string (nullable = false)
 |    |-- continent: string (nullable = false)

val newDF = df.
  as[(String, (String, String))].  // <-- convert to typed Dataset as it makes map easier
  map { case (ts, (city, continent)) =>
    (ts, (city, continent, "New field with some value")) }. // <-- add new column
  toDF("timestamp", "geoip") // <-- name the top-level fields

scala> newDF.printSchema
root
 |-- timestamp: string (nullable = true)
 |-- geoip: struct (nullable = true)
 |    |-- _1: string (nullable = true)
 |    |-- _2: string (nullable = true)
 |    |-- _3: string (nullable = true)

当你丢失列的名字时，这并不漂亮。

让我们用正确的名称定义架构。这就是你可以StructType与StructFields一起使用的地方（你也可以使用一组案例类，但我把它作为家庭练习留给你）。

import org.apache.spark.sql.types._
val geoIP = StructType(
  $"city".string ::
  $"continent".string ::
  $"new_field".string ::
  Nil
)
val mySchema = StructType(
  $"timestamp".string ::
  $"geoip".struct(geoIP) ::
  Nil
)

scala> mySchema.printTreeString
root
 |-- timestamp: string (nullable = true)
 |-- geoip: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- continent: string (nullable = true)
 |    |-- new_field: string (nullable = true)

将新架构应用于正确的名称。

val properNamesDF = spark.createDataFrame(newDF.rdd, mySchema)
scala> properNamesDF.show(truncate = false)
+---------+-----------------------------------------+
|timestamp|geoip                                    |
+---------+-----------------------------------------+
|timestamp|[Warsaw,Europe,New field with some value]|
+---------+-----------------------------------------+

如何将字段添加到“struct of struct”

如果你觉得自己很冒险，你可能想要使用StructType作为集合类型，并使用Scala的Collection API和复制构造函数重新塑造它。

您想要去的深度以及您想要修改的“结构体结构”的级别并不重要。只需将StructType视为StructField的集合，而StructField又可以是StructTypes。

val oldSchema = newDF.schema
val names = Seq("city", "continent", "new_field")
val geoipFields = oldSchema("geoip").
  dataType.
  asInstanceOf[StructType].
  zip(names).
  map { case (field, name) => field.copy(name = name) }
val myNewSchema = StructType(
  $"timestamp".string :: 
  $"geoip".struct(StructType(geoipFields)) :: Nil)
val properNamesDF = spark.createDataFrame(newDF.rdd, myNewSchema)
scala> properNamesDF.printSchema
root
 |-- timestamp: string (nullable = true)
 |-- geoip: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- continent: string (nullable = true)
 |    |-- new_field: string (nullable = true)

withColumn Operator with struct Function

您可以将withColumn运算符与struct函数一起使用。

withColumn（colName：String，col：Column）：DataFrame 通过添加列或替换具有相同名称的现有列来返回新的数据集。

struct（cols：Column *）：Column 创建一个新的struct列。

代码可能如下所示：

val anotherNewDF = df.
  withColumn("geoip", // <-- use the same column name so you hide the existing one
    struct(
      $"geoip.city", // <-- reference existing column to copy the values
      $"geoip.continent",
      lit("new value") as "new_field")) // <-- new field with fixed value

scala> anotherNewDF.printSchema
root
 |-- ts: string (nullable = true)
 |-- geoip: struct (nullable = false)
 |    |-- city: string (nullable = false)
 |    |-- continent: string (nullable = false)
 |    |-- new_field: string (nullable = false)

根据@shj的评论，您可以使用通配符来避免重新列出列，这使得它非常灵活，例如。

val anotherNewDF = df
  .withColumn("geoip",
    struct(
      $"geoip.*", // <-- the wildcard here
      lit("new value") as "new_field"))

Answer 2

你也可以简单地做：

df = df.withColumn("goip", struct($"geoip.*", lit("This is fine.").alias("error")))

这会在“geoip”结构中添加一个“error”字段。

Answer 3

以同样的方式，但通过引用列。

df = df("location").withColumn("error", lit(null).cast(StringType))

如何向struct column添加新字段？

3 个答案:

map Operator（最灵活）

如何将字段添加到“struct of struct”

withColumn Operator with struct Function