我具有以下df1模式
root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- applicationid: string (nullable = true)
| | |-- createdat: string (nullable = true)
| | |-- source_name: string (nullable = true)
| | |-- accounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- applicationcreditreportaccountid: string
(nullable = true)
| | | | |-- account_name: integer (nullable = true)
| | | | |-- account_department: string (nullable = true)
在df2模式下:
root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- applicationid: string (nullable = true)
| | |-- updatedat: string (nullable = true)
| | |-- accounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- applicationcreditreportaccountid: string (nullable = true)
| | | | |-- account_value: integer (nullable = true)
| | | | |-- account_department: string (nullable = true)
我想合并两个数据框的架构,并且不存在那些字段,因此应该为None。
我已经尝试了扁平结构及其工作的代码。
def schema_combine(df_left, df_right):
left_fields = set((x.name, x.dataType, x.nullable) for x in df_left.schema)
right_fields = set((x.name, x.dataType, x.nullable) for x in df_right.schema)
# First go over left-unique fields
for l_name, l_type, l_nullable in left_fields.difference(right_fields):
df_right = df_right.withColumn(l_name, f.lit(None).cast(StringType()))
# if l_name in right_types:
# r_type = right_types[l_name]
# Now go over right-unique fields
for r_name, r_type, r_nullable in right_fields.difference(left_fields):
df_left = df_left.withColumn(r_name, f.lit(None).cast(StringType()))
res = [df_left, df_right]
return res
我期望得到以下结果
root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- applicationid: string (nullable = true)
| | |-- createdat: string (nullable = true)
| | |-- updatedat: string (nullable = true)
| | |-- source_name: string (nullable = true)
| | |-- accounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- applicationcreditreportaccountid: string (nullable = true)
| | | | |-- account_name: integer (nullable = true)
| | | | |-- account_value: string (nullable = true)
| | | | |-- account_department: string (nullable = true)
是否可以合并嵌套数据类型的架构,即Struct类型的数组。任何建议都会有所帮助。
注意:我有n个Struct类型的strut字段。因此,需要动态编写。