Spark SQL嵌套withColumn

时间:2017-06-29 17:44:57

标签: scala apache-spark dataframe udf

我有一个DataFrame,它有多个列,其中一些是结构。像这样的东西

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)

我想在UserDefinedFunction列上应用bazbaz替换baz,但我无法弄清楚如何做到这一点。以下是所需输出的示例(请注意,baz现在是int

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: int (nullable = true)
 |-- abc: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- def: struct (nullable = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: integer (nullable = true)
 |    |    |    |-- c: string (nullable = true)

看起来DataFrame.withColumn仅适用于顶级列,但不适用于嵌套列。我正在使用Scala解决这个问题。

有人可以帮我解决这个问题吗?

由于

2 个答案:

答案 0 :(得分:16)

这很简单,只需使用一个点来选择嵌套结构,例如$"foo.baz"

case class Foo(bar:String,baz:String)
case class Record(foo:Foo)

val df = Seq(
   Record(Foo("Hi","There"))
).toDF()


df.printSchema

root
 |-- foo: struct (nullable = true)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)


val myUDF = udf((s:String) => {
 // do something with s 
  s.toUpperCase
})


df
.withColumn("udfResult",myUDF($"foo.baz"))
.show

+----------+---------+
|       foo|udfResult|
+----------+---------+
|[Hi,There]|    THERE|
+----------+---------+

如果要将UDF的结果添加到现有结构foo,即获取:

root
 |-- foo: struct (nullable = false)
 |    |-- bar: string (nullable = true)
 |    |-- baz: string (nullable = true)
 |    |-- udfResult: string (nullable = true)

有两种选择:

withColumn

df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")

select

df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))

编辑: 使用UDF的结果替换struct中的现有属性: 不幸的是,这确实工作:

df
.withColumn("foo.baz",myUDF($"foo.baz")) 

但可以这样做:

// get all columns except foo.baz
val structCols = df.select($"foo.*")
    .columns
    .filter(_!="baz")
    .map(name => col("foo."+name))

df.withColumn(
    "foo",
    struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)

答案 1 :(得分:1)

您可以使用struct函数来执行此操作,因为Raphael Roth已在上面的答案中得到了证明。通过使用Make Structs Easy *库,有一种更简单的方法可以执行此操作。该库向Column类添加了withField方法,使您可以在StructType列内添加/替换列,这与DataFrame类的withColumn方法允许您添加/替换列的方式几乎相同在DataFrame中。对于您的特定用例,您可以执行以下操作:

import org.apache.spark.sql.functions._
import com.github.fqaiser94.mse.methods._

// generate some fake data
case class Foo(bar: String, baz: String)
case class Record(foo: Foo, arrayOfFoo: Seq[Foo])

val df = Seq(
   Record(Foo("Hello", "World"), Seq(Foo("Blue", "Red"), Foo("Green", "Yellow")))
).toDF

df.printSchema

// root
//  |-- foo: struct (nullable = true)
//  |    |-- bar: string (nullable = true)
//  |    |-- baz: string (nullable = true)
//  |-- arrayOfFoo: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- bar: string (nullable = true)
//  |    |    |-- baz: string (nullable = true)

df.show(false)

// +--------------+------------------------------+
// |foo           |arrayOfFoo                    |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+

// example user defined function that capitalizes a given string
val myUdf = udf((s: String) => s.toUpperCase)

// capitalize value of foo.baz
df.withColumn("foo", $"foo".withField("baz", myUdf($"foo.baz"))).show(false)

// +--------------+------------------------------+
// |foo           |arrayOfFoo                    |
// +--------------+------------------------------+
// |[Hello, WORLD]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+

我注意到您有一个后续问题,关于替换嵌套在数组内部的struct中嵌套的Column。 也可以通过将Make Structs Easy库提供的功能与spark-hofs库提供的功能进行组合来完成此操作,如下所示:

import za.co.absa.spark.hofs._

// capitalize the value of foo.baz in each element of arrayOfFoo
df.withColumn("arrayOfFoo", transform($"arrayOfFoo", foo => foo.withField("baz", myUdf(foo.getField("baz"))))).show(false)

// +--------------+------------------------------+
// |foo           |arrayOfFoo                    |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, RED], [Green, YELLOW]]|
// +--------------+------------------------------+

*完全公开:我是此答案中引用的Make Structs Easy库的作者。