使用pyspark展平复杂的JSON模式

时间:2019-07-09 14:35:40

标签: pyspark

我正在尝试使用通用函数展平包含嵌套数组和结构元素的复杂JSON结构,该函数应适用于具有任何模式的任何JSON文件。

下面是我想弄平的示例JSON结构的一部分

root
 |-- Data: struct (nullable = true)
 |    |-- Record: struct (nullable = true)
 |    |    |-- FName: string (nullable = true)
 |    |    |-- LName: long (nullable = true)
 |    |    |-- Address: struct (nullable = true)
 |    |    |    |-- Applicant: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- Id: long (nullable = true)
 |    |    |    |    |    |-- Type: string (nullable = true)
 |    |    |    |    |    |-- Option: long (nullable = true)
 |    |    |    |-- Location: string (nullable = true)
 |    |    |    |-- Town: long (nullable = true)
 |    |    |-- IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

root
 |-- Data_Record_FName: string (nullable = true)
 |-- Data_Record_LName: long (nullable = true)
 |-- Data_Record_Address_Applicant_Id: long (nullable = true)
 |-- Data_Record_Address_Applicant_Type: string (nullable = true)
 |-- Data_Record_Address_Applicant_Option: long (nullable = true)
 |-- Data_Record_Address_Location: string (nullable = true)
 |-- Data_Record_Address_Town: long (nullable = true)
 |-- Data_Record_IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

我正在以下线程中建议使用以下代码

How to flatten a struct in a Spark dataframe?

def flatten_df(nested_df, layers):
    flat_cols = []
    nested_cols = []
    flat_df = []

    flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
    nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])

    flat_df.append(nested_df.select(flat_cols[0] +
                               [col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols[0]
                                for c in nested_df.select(nc+'.*').columns])
                  )
    for i in range(1, layers):
        print (flat_cols[i-1])
        flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
        nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])

        flat_df.append(flat_df[i-1].select(flat_cols[i] +
                                [col(nc+'.'+c).alias(nc+'_'+c)
                                    for nc in nested_cols[i]
                                    for c in flat_df[i-1].select(nc+'.*').columns])
        )

    return flat_df[-1]

my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()   

但是它不适用于数组元素。通过上面的代码,我得到如下输出。你能帮忙吗?如何修改这段代码,使其也包含数组。

root
 |-- Data_Record_FName: string (nullable = true)
 |-- Data_Record_LName: long (nullable = true)
 |-- Data_Record_Address_Applicant: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Id: long (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- Option: long (nullable = true)
 |-- Data_Record_Address_Location: string (nullable = true)
 |-- Data_Record_Address_Town: long (nullable = true)
 |-- Data_Record_IsActive: boolean (nullable = true)
 |-- Id: string (nullable = true)

这不是重复的,因为没有关于泛型函数来展平包含数组的复杂JSON模式的帖子。

0 个答案:

没有答案