如何合并具有复杂架构的不同数据框

时间:2019-09-02 14:57:26

标签: arrays python-3.x struct pyspark apache-spark-sql

我具有以下df1模式

root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- applicationid: string (nullable = true)
|    |    |-- createdat: string (nullable = true)
|    |    |-- source_name: string (nullable = true)
|    |    |-- accounts: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- applicationcreditreportaccountid: string 
                                                   (nullable = true)
|    |    |    |    |-- account_name: integer (nullable = true)
|    |    |    |    |-- account_department: string (nullable = true)

在df2模式下:

root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- applicationid: string (nullable = true)
|    |    |-- updatedat: string (nullable = true)
|    |    |-- accounts: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- applicationcreditreportaccountid: string (nullable = true)
|    |    |    |    |-- account_value: integer (nullable = true)
|    |    |    |    |-- account_department: string (nullable = true)

我想合并两个数据框的架构,并且不存在那些字段,因此应该为None。

我已经尝试了扁平结构及其工作的代码。

def schema_combine(df_left, df_right):
    left_fields = set((x.name, x.dataType, x.nullable) for x in df_left.schema)
    right_fields = set((x.name, x.dataType, x.nullable) for x in df_right.schema)
    # First go over left-unique fields
    for l_name, l_type, l_nullable in left_fields.difference(right_fields):
        df_right = df_right.withColumn(l_name, f.lit(None).cast(StringType()))
# if l_name in right_types:
        #     r_type = right_types[l_name]
    # Now go over right-unique fields
    for r_name, r_type, r_nullable in right_fields.difference(left_fields):

        df_left = df_left.withColumn(r_name, f.lit(None).cast(StringType()))
    res = [df_left, df_right]
    return res

我期望得到以下结果

root
|-- userid: string (nullable = true)
|-- name: string (nullable = true)
|-- applications: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- applicationid: string (nullable = true)
|    |    |-- createdat: string (nullable = true)
|    |    |-- updatedat: string (nullable = true)
|    |    |-- source_name: string (nullable = true)
|    |    |-- accounts: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- applicationcreditreportaccountid: string (nullable = true)
|    |    |    |    |-- account_name: integer (nullable = true)
|    |    |    |    |-- account_value: string (nullable = true)
|    |    |    |    |-- account_department: string (nullable = true)

是否可以合并嵌套数据类型的架构,即Struct类型的数组。任何建议都会有所帮助。

注意:我有n个Struct类型的strut字段。因此,需要动态编写。

0 个答案:

没有答案