pyspark:基于嵌套键连接表

时间:2017-11-13 16:49:32

标签: apache-spark pyspark spark-dataframe pyspark-sql

我有两个表格,下面是示例模式。表A的键嵌套在表B的列表中。我想基于表A键连接表A和表B以生成表C.表A中的值应该是表C中的嵌套结构表B中的keyAs列表。如何使用pyspark执行此操作?谢谢!

表A

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyA: string (nullable = true) 

表B

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true)

表C

root 
|-- item1: string (nullable = true) 
|-- item2: long (nullable = true) 
|-- keyB: string (nullable = true) 
|-- keyAs: array (nullable = true) 
| |-- element: string (containsNull = true) 
|-- valueAs: array (nullable = true) 
| |-- element: struct (containsNull = true) 
| | |-- item1: string (nullable = true) 
| | |-- item2: long (nullable = true) 
| | |-- keyA: string (nullable = true)

1 个答案:

答案 0 :(得分:1)

要加入A和B,您需要首先爆炸B.keyAs,如下所示:

tableB.withColumn('keyA', explode('keyAs')).join(tableA, 'keyA')

要创建嵌套结构,请参阅this answer