Question

我是pyspark的新手。这是我从mongodb获得的架构。 df.printSchema（）

root
 |-- machine_id: string (nullable = true)
 |-- profiles: struct (nullable = true)
 |    |-- node_a: struct (nullable = true)
 |    |    |-- profile_1: struct (nullable = true)
 |    |    |    |-- duration: string (nullable = true)
 |    |    |    |-- log_count: string (nullable = true)
 |    |    |    |-- log_att: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- count: string (nullable = true)
 |    |    |    |    |    |-- log_content: string (nullable = true)
 |    |    |-- profile_2: struct (nullable = true)
 |    |    |    |-- duration: string (nullable = true)
 |    |    |    |-- log_count: string (nullable = true)
 |    |    |    |-- log_att: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- count: string (nullable = true)
 |    |    |    |    |    |-- log_content: string (nullable = true)
 |    |    |-- profile_3: struct (nullable = true)
 |    |    |-- profile_4: struct (nullable = true)
 |    |    |-- ...
 |    |-- node_b: struct (nullable = true)
 |    |    |-- profile_1: struct (nullable = true)
 |    |    |    |-- duration: string (nullable = true)
 |    |    |    |-- log_count: string (nullable = true)
 |    |    |    |-- log_att: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- count: string (nullable = true)
 |    |    |    |    |    |-- log_content: string (nullable = true)
 |    |    |-- profile_2: struct (nullable = true)
 |    |    |    |-- duration: string (nullable = true)
 |    |    |    |-- log_count: string (nullable = true)
 |    |    |    |-- log_att: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- count: string (nullable = true)
 |    |    |    |    |    |-- log_content: string (nullable = true)
 |    |    |-- profile_3: struct (nullable = true)
 |    |    |-- profile_4: struct (nullable = true)
 |    |    |-- ...

对于每台机器，我有2个节点，对于每个节点，我有许多配置文件。我需要为每个配置文件分配持续时间。例如对于profile_1，count（1 <=持续时间<2）。我可以使用哪种类型的数据框API？我唯一想到的是： 1.展平node_a和node_b new_df = df.selectExpr（flatten（df.schema，None，2）） 2.获取node_a和node_b的新数据帧 df_a = new_df.selectExpr（“ machine_id”，“ node_a”） df_b = new_df.selectExpr（“ machine_id”，“ node_b”） 3.然后将df_a和df_b展平，这样我可以有2个具有以下架构的数据框：

 |-- machine_id: string (nullable = true)
 |-- profile_1: struct (nullable = true)
 |    |-- duration: string (nullable = true)
 |    |-- log_count: string (nullable = true)
 |    |-- log_att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- count: string (nullable = true)
 |    |    |    |-- log_content: string (nullable = true)
 |-- profile_2: struct (nullable = true)
 |    |-- duration: string (nullable = true)
 |    |-- log_count: string (nullable = true)
 |    |-- log_att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- count: string (nullable = true)
 |    |    |    |-- log_content: string (nullable = true)
 |-- profile_3: struct (nullable = true)
 |-- profile_4: struct (nullable = true)
 |-- ...

我认为这是一种非常愚蠢的方法。还有其他“更智能”的方法吗？

Answer 1

嗯...我终于找到了解决该问题的新方法。不确定这是否是一种好方法，但肯定比愚蠢的好

def flatten(schema, prefix=None):
    for field in schema.fields:
        dtype = field.dataType
        field_name = field.name
        name = prefix + '.' + field_name if prefix else field_name
        if field_name == "profiles" \
            or re.search(r'machine_[ab]', field_name \
            or re.match(r'profile_\d+', field_name)):
            flatten(dtype, prefix=name)
        elif re.search(r'profile_\d+', name):
            for sub_name in dtype.names:
                sub_names.append(name + '.' + sub_name)
            print(sub_names)
            create_new_table(sub_names)
    return

如何使用pyspark遍历/迭代数据框？

1 个答案: