我正在尝试使用通用函数展平包含嵌套数组和结构元素的复杂JSON结构,该函数应适用于具有任何模式的任何JSON文件。
下面是我想弄平的示例JSON结构的一部分
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
到
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant_Id: long (nullable = true)
|-- Data_Record_Address_Applicant_Type: string (nullable = true)
|-- Data_Record_Address_Applicant_Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
我正在以下线程中建议使用以下代码
How to flatten a struct in a Spark dataframe?
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()
但是它不适用于数组元素。通过上面的代码,我得到如下输出。你能帮忙吗?如何修改这段代码,使其也包含数组。
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
这不是重复的,因为没有关于泛型函数来展平包含数组的复杂JSON模式的帖子。