我有两个表格,下面是示例模式。表A的键嵌套在表B的列表中。我想基于表A键连接表A和表B以生成表C.表A中的值应该是表C中的嵌套结构表B中的keyAs列表。如何使用pyspark执行此操作?谢谢!
表A
root
|-- item1: string (nullable = true)
|-- item2: long (nullable = true)
|-- keyA: string (nullable = true)
表B
root
|-- item1: string (nullable = true)
|-- item2: long (nullable = true)
|-- keyB: string (nullable = true)
|-- keyAs: array (nullable = true)
| |-- element: string (containsNull = true)
表C
root
|-- item1: string (nullable = true)
|-- item2: long (nullable = true)
|-- keyB: string (nullable = true)
|-- keyAs: array (nullable = true)
| |-- element: string (containsNull = true)
|-- valueAs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- item1: string (nullable = true)
| | |-- item2: long (nullable = true)
| | |-- keyA: string (nullable = true)
答案 0 :(得分:1)
要加入A和B,您需要首先爆炸B.keyAs
,如下所示:
tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')
要创建嵌套结构,请参阅this answer