我是pyspark的新手。 这是我从mongodb获得的架构。 df.printSchema()
root
|-- machine_id: string (nullable = true)
|-- profiles: struct (nullable = true)
| |-- node_a: struct (nullable = true)
| | |-- profile_1: struct (nullable = true)
| | | |-- duration: string (nullable = true)
| | | |-- log_count: string (nullable = true)
| | | |-- log_att: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- count: string (nullable = true)
| | | | | |-- log_content: string (nullable = true)
| | |-- profile_2: struct (nullable = true)
| | | |-- duration: string (nullable = true)
| | | |-- log_count: string (nullable = true)
| | | |-- log_att: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- count: string (nullable = true)
| | | | | |-- log_content: string (nullable = true)
| | |-- profile_3: struct (nullable = true)
| | |-- profile_4: struct (nullable = true)
| | |-- ...
| |-- node_b: struct (nullable = true)
| | |-- profile_1: struct (nullable = true)
| | | |-- duration: string (nullable = true)
| | | |-- log_count: string (nullable = true)
| | | |-- log_att: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- count: string (nullable = true)
| | | | | |-- log_content: string (nullable = true)
| | |-- profile_2: struct (nullable = true)
| | | |-- duration: string (nullable = true)
| | | |-- log_count: string (nullable = true)
| | | |-- log_att: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- count: string (nullable = true)
| | | | | |-- log_content: string (nullable = true)
| | |-- profile_3: struct (nullable = true)
| | |-- profile_4: struct (nullable = true)
| | |-- ...
对于每台机器,我有2个节点,对于每个节点,我有许多配置文件。我需要为每个配置文件分配持续时间。 例如对于profile_1,count(1 <=持续时间<2)。我可以使用哪种类型的数据框API? 我唯一想到的是: 1.展平node_a和node_b new_df = df.selectExpr(flatten(df.schema,None,2)) 2.获取node_a和node_b的新数据帧 df_a = new_df.selectExpr(“ machine_id”,“ node_a”) df_b = new_df.selectExpr(“ machine_id”,“ node_b”) 3.然后将df_a和df_b展平,这样我可以有2个具有以下架构的数据框:
|-- machine_id: string (nullable = true)
|-- profile_1: struct (nullable = true)
| |-- duration: string (nullable = true)
| |-- log_count: string (nullable = true)
| |-- log_att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- count: string (nullable = true)
| | | |-- log_content: string (nullable = true)
|-- profile_2: struct (nullable = true)
| |-- duration: string (nullable = true)
| |-- log_count: string (nullable = true)
| |-- log_att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- count: string (nullable = true)
| | | |-- log_content: string (nullable = true)
|-- profile_3: struct (nullable = true)
|-- profile_4: struct (nullable = true)
|-- ...
我认为这是一种非常愚蠢的方法。还有其他“更智能”的方法吗?
答案 0 :(得分:0)
嗯...我终于找到了解决该问题的新方法。不确定这是否是一种好方法,但肯定比愚蠢的好
def flatten(schema, prefix=None):
for field in schema.fields:
dtype = field.dataType
field_name = field.name
name = prefix + '.' + field_name if prefix else field_name
if field_name == "profiles" \
or re.search(r'machine_[ab]', field_name \
or re.match(r'profile_\d+', field_name)):
flatten(dtype, prefix=name)
elif re.search(r'profile_\d+', name):
for sub_name in dtype.names:
sub_names.append(name + '.' + sub_name)
print(sub_names)
create_new_table(sub_names)
return