替换深层嵌套架构Spark Dataframe中的值

时间:2019-12-08 17:25:29

标签: apache-spark pyspark apache-spark-sql pyspark-sql

我是pyspark的新手。我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件。我需要用null替换数据帧(具有嵌套模式)中的某些值,我已经看到此solution可以很好地与结构配合使用,但不确定如何与数组一起使用。

我的模式是这样的

|-- unitOfMeasure: struct
|    |-- raw: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|    |    |-- display: string
|    |-- standard: struct
|    |    |-- id: string
|    |    |-- codingSystemId: string
|-- Id: string
|-- actions: array
|    |-- element: struct
|    |    |-- action: string
|    |    |-- actionDate: string
|    |    |-- actor: struct
|    |    |    |-- actorId: string
|    |    |    |-- aliases: array
|    |    |    |    |-- element: struct
|    |    |    |    |    |-- value: string
|    |    |    |    |    |-- type: string
|    |    |    |    |    |-- assigningAuthority: string
|    |    |    |-- fullName: string

我想做的是将unitOfMeasure.raw.id替换为null 和actions.element.action为null 和actions.element.actor.aliases.element.value(带有null)使其余数据框保持不变。

有什么办法可以做到这一点?

1 个答案:

答案 0 :(得分:1)

对于数组列,与struct字段相比有点复杂。 一种选择是将数组分解为新列,以便您可以访问和更新嵌套的结构。更新之后,您将必须重建初始数组列。

但是我更喜欢使用为Spark> = 2.4引入的高阶函数transform,这是一个示例:

输入DF:

 |-- actions: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- action: string (nullable = true)
 |    |    |-- actionDate: string (nullable = true)
 |    |    |-- actor: struct (nullable = true)
 |    |    |    |-- actorId: long (nullable = true)
 |    |    |    |-- aliases: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assigningAuthority: string (nullable = true)
 |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- fullName: string (nullable = true)

+--------------------------------------------------------------+
|actions                                                       |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+

我们向transfrom传递了一个lambda函数,该函数选择了所有结构字段,并用actions.action替换了actions.actor.aliases.valuenull

transform_expr = """transform (actions, x -> 
                               struct(null as action, 
                                      x.actionDate as actionDate, 
                                      struct(x.actor.actorId as actorId, 
                                             transform(x.actor.aliases, y -> 
                                                       struct(null as value, 
                                                              y.type as type, 
                                                              y.assigningAuthority as assigningAuthority)
                                                       ) as aliases,
                                            x.actor.fullName as fullName
                                      ) as actor
                                ))"""

df.withColumn("actions", expr(transform_expr)).show(truncate=False)

输出DF:

+------------------------------------------------+
|actions                                         |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+