如何在PySpark中转换嵌套的数据帧模式

时间:2018-02-15 23:08:33

标签: python apache-spark dataframe pyspark apache-spark-sql

我有一个包含以下架构的数据框:

root                                                                            
|-- _1: struct (nullable = true)
|    |-- key: string (nullable = true)
|-- _2: struct (nullable = true)
|    |-- value: long (nullable = true)

我想将数据帧转换为以下架构:

root
|-- _1: struct (nullable = true)                                                                            
|    |-- key: string (nullable = true)
|    |-- value: long (nullable = true)

2 个答案:

答案 0 :(得分:2)

使用struct

  

pyspark.sql.functions.struct(*cols)

     

创建一个新的结构列。

from pyspark.sql.functions import struct, col
from pyspark.sql import Row

df = spark.createDataFrame([Row(_1=Row(key="a"), _2=Row(value=1))])

result = df.select(struct(col("_1.key"), col("_2.value")).alias("_1"))

给出:

result.printSchema()
# root
#  |-- _1: struct (nullable = false)
#  |    |-- key: string (nullable = true)
#  |    |-- value: long (nullable = true)

result.show()
# +-----+
# |   _1|
# +-----+
# |[a,1]|
# +-----+

答案 1 :(得分:2)

如果您的<div id="app"> <div v-for="t in team" v-bind:key="t.id" v-bind:author="t.author"> {{t.author}} <div v-for="m in t.members" v-bind:key="m.id"> {{m.name}} </div> </div> </div> 有以下dataframe

schema

然后,您可以使用root |-- _1: struct (nullable = true) | |-- key: string (nullable = true) |-- _2: struct (nullable = true) | |-- value: long (nullable = true) 选择 struct 的所有元素到单独的中,然后使用{{1} } 内置函数将它们组合回一个 struct 字段

*

您应该获得所需的输出struct

from pyspark.sql import functions as F
df.select(F.struct("_1.*", "_2.*").alias("_1"))

<强>更新

如果原始dataframe中的所有都是 struct ,则上述代码的更通用形式如下所示

root
 |-- _1: struct (nullable = false)
 |    |-- key: string (nullable = true)
 |    |-- value: long (nullable = true)