激发高阶函数变换输出结构

时间:2019-09-15 11:13:21

标签: apache-spark struct apache-spark-sql higher-order-functions complextype

我如何transform使用火花高级函数将结构数组再次转换为结构?

数据集:

case class Foo(thing1:String, thing2:String, thing3:String)
case class Baz(foo:Foo, other:String)
case class Bar(id:Int, bazes:Seq[Baz])
import spark.implicits._
val df = Seq(Bar(1, Seq(Baz(Foo("first", "second", "third"), "other"), Baz(Foo("1", "2", "3"), "else")))).toDF
df.printSchema
df.show(false)

我想连接所有thing1, thign2, thing3,但保留每个other的{​​{1}}属性。

简单:

bar

只会将内容复制过来。

所需的连续操作:

scala> df.withColumn("cleaned", expr("transform(bazes, x -> x)")).printSchema
root
 |-- id: integer (nullable = false)
 |-- bazes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- foo: struct (nullable = true)
 |    |    |    |-- thing1: string (nullable = true)
 |    |    |    |-- thing2: string (nullable = true)
 |    |    |    |-- thing3: string (nullable = true)
 |    |    |-- other: string (nullable = true)
 |-- cleaned: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- foo: struct (nullable = true)
 |    |    |    |-- thing1: string (nullable = true)
 |    |    |    |-- thing2: string (nullable = true)
 |    |    |    |-- thing3: string (nullable = true)
 |    |    |-- other: string (nullable = true)

很遗憾,它将删除 df.withColumn("cleaned", expr("transform(bazes, x -> concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3))")).printSchema 列中的所有值:

other

如何保留这些? 尝试保留元组:

 +---+----------------------------------------------------+-------------------------------+
|id |bazes                                               |cleaned                        |
+---+----------------------------------------------------+-------------------------------+
|1  |[[[first, second, third], other], [[1, 2, 3], else]]|[first::second::third, 1::2::3]|
+---+----------------------------------------------------+-------------------------------+

失败:

df.withColumn("cleaned", expr("transform(bazes, x -> (concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3), x.other))")).printSchema

编辑

所需的输出:

  • 包含内容的新列

    [[[first :: second :: third,other],[1 :: 2 :: 3,else]

保留第.AnalysisException: cannot resolve 'named_struct('col1', concat(namedlambdavariable().`foo`.`thing1`, '::', namedlambdavariable().`foo`.`thing2`, '::', namedlambdavariable().`foo`.`thing3`), NamePlaceholder(), namedlambdavariable().`other`)' due to data type mismatch: Only foldable string expressions are allowed to appear at odd position, got: NamePlaceholder; line 1 pos 22;

1 个答案:

答案 0 :(得分:3)

  

这样,您可以实现所需的输出。您不能直接访问其他值bcoz foo,而其他共享相同的层次结构。因此您需要单独访问其他。

scala>  df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).show(false)
+---+----------------------------------------------------+------------------------------------------------+
|id |bazes                                               |cleaned                                         |
+---+----------------------------------------------------+------------------------------------------------+
  

printSchema

scala>  df.withColumn("cleaned", expr("transform(bazes, x -> struct(concat(x.foo.thing1, '::', x.foo.thing2, '::', x.foo.thing3),cast(x.other as string)))")).printSchema
root
 |-- id: integer (nullable = false)
 |-- bazes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- foo: struct (nullable = true)
 |    |    |    |-- thing1: string (nullable = true)
 |    |    |    |-- thing2: string (nullable = true)
 |    |    |    |-- thing3: string (nullable = true)
 |    |    |-- other: string (nullable = true)
 |-- cleaned: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- col1: string (nullable = true)
 |    |    |-- col2: string (nullable = true)

让我知道您是否还有其他与此问题有关的问题。