我有一个DataFrame,它有多个列,其中一些是结构。像这样的东西
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
我想在UserDefinedFunction
列上应用baz
以baz
替换baz
,但我无法弄清楚如何做到这一点。以下是所需输出的示例(请注意,baz
现在是int
)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
看起来DataFrame.withColumn
仅适用于顶级列,但不适用于嵌套列。我正在使用Scala解决这个问题。
有人可以帮我解决这个问题吗?
由于
答案 0 :(得分:16)
这很简单,只需使用一个点来选择嵌套结构,例如$"foo.baz"
:
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
如果要将UDF的结果添加到现有结构foo
,即获取:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
有两种选择:
withColumn
:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
select
:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
编辑: 使用UDF的结果替换struct中的现有属性: 不幸的是,这确实不工作:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
但可以这样做:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)
答案 1 :(得分:1)
您可以使用struct
函数来执行此操作,因为Raphael Roth已在上面的答案中得到了证明。通过使用Make Structs Easy *库,有一种更简单的方法可以执行此操作。该库向Column类添加了withField
方法,使您可以在StructType列内添加/替换列,这与DataFrame类的withColumn
方法允许您添加/替换列的方式几乎相同在DataFrame中。对于您的特定用例,您可以执行以下操作:
import org.apache.spark.sql.functions._
import com.github.fqaiser94.mse.methods._
// generate some fake data
case class Foo(bar: String, baz: String)
case class Record(foo: Foo, arrayOfFoo: Seq[Foo])
val df = Seq(
Record(Foo("Hello", "World"), Seq(Foo("Blue", "Red"), Foo("Green", "Yellow")))
).toDF
df.printSchema
// root
// |-- foo: struct (nullable = true)
// | |-- bar: string (nullable = true)
// | |-- baz: string (nullable = true)
// |-- arrayOfFoo: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- bar: string (nullable = true)
// | | |-- baz: string (nullable = true)
df.show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
// example user defined function that capitalizes a given string
val myUdf = udf((s: String) => s.toUpperCase)
// capitalize value of foo.baz
df.withColumn("foo", $"foo".withField("baz", myUdf($"foo.baz"))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, WORLD]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
我注意到您有一个后续问题,关于替换嵌套在数组内部的struct中嵌套的Column。 也可以通过将Make Structs Easy库提供的功能与spark-hofs库提供的功能进行组合来完成此操作,如下所示:
import za.co.absa.spark.hofs._
// capitalize the value of foo.baz in each element of arrayOfFoo
df.withColumn("arrayOfFoo", transform($"arrayOfFoo", foo => foo.withField("baz", myUdf(foo.getField("baz"))))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, RED], [Green, YELLOW]]|
// +--------------+------------------------------+
*完全公开:我是此答案中引用的Make Structs Easy库的作者。