Question

我正在尝试使用pyspark在Spark数据框中删除一些嵌套列。我发现这对Scala来说似乎正在做我想要的但我不熟悉Scala而且不知道如何用Python编写它。

https://stackoverflow.com/a/39943812/5706548

我真的很感激一些帮助。

谢谢，

Answer 1

我发现使用pyspark的方法是首先将嵌套列转换为json，然后使用新的嵌套模式解析转换后的json，并过滤掉不需要的列。

假设我有以下架构，并且我想从数据框中删除d和e（a.b.d，a.e）：

root
 |-- a: struct (nullable = true)
 |    |-- b: struct (nullable = true)
 |    |    |-- c: long (nullable = true)
 |    |    |-- d: string (nullable = true)
 |    |-- e: struct (nullable = true)
 |    |    |-- f: long (nullable = true)
 |    |    |-- g: string (nullable = true)
 |-- h: string (nullable = true)

我使用了以下方法：

通过排除a和d为e创建新架构。快速执行此操作的方法是手动从df.select("a").schema中选择所需的字段，然后使用StructType从所选字段中创建新架构。或者，您可以通过遍历架构树以及排除不需要的字段（例如：

）以编程方式执行此操作

def exclude_nested_field(schema, unwanted_fields, parent=""):
    new_schema = []

    for field in schema:
        full_field_name = field.name
        if parent:
            full_field_name = parent + "." + full_field_name

        if full_field_name not in unwanted_fields:
            if isinstance(field.dataType, StructType):
                inner_schema = exclude_nested_field(field.dataType, unwanted_fields, full_field_name)
                new_schema.append(StructField(field.name, inner_schema))
            else:
                new_schema.append(StructField(field.name, field.dataType))

    return StructType(new_schema)

new_schema = exclude_nested_field(df.select("a").schema, ["a.b.d", "a.e"])

将a列转换为json：F.to_json("a")
使用步骤1中找到的新架构解析步骤2中json转换的a列：F.from_json("a_json", new_schema)

Answer 2

Althoug我没有PySpark的解决方案，也许将它转换为python更容易。考虑具有架构的数据框df：

root
 |-- employee: struct (nullable = false)
 |    |-- name: string (nullable = false)
 |    |-- age: integer (nullable = false)

然后，如果你想要，例如放弃name，你可以这样做：

val fieldsToKeep = df.select($"employee.*").columns
.filter(_!="name") // the nested column you want to drop
.map(n => "employee."+n)

// overwite column with subset of fields
df
.withColumn("employee",struct(fieldsToKeep.head,fieldsToKeep.tail:_*))

Answer 3

Pyspark版本的Raphaels Scala答案。

此操作在某个深度运行，丢弃该深度以上的所有内容，并在其下一行进行过滤。

def remove_columns(df,root):
  from pyspark.sql.functions import col
  cols = df.select(root).columns
  fields_filter = filter(lambda x: x[0]!= "$", cols) # use your own lambda here. 
  fieldsToKeep = list(map(lambda x: root[:-1] + x, fields_filter)) 
  return df.select(fieldsToKeep)

df = remove_columns(raw_df, root="level1.level2.*")

Answer 4

Pyspark版本：

def drop_col(df, col_nm, delete_col_nm):
    fields_to_keep = filter(lambda x:  x != delete_col_nm, df.select(" {}.*".format(col_nm)).columns)
    fields_to_keep = list(map(lambda x:  "{}.{}".format(col_nm, x), fields_to_keep))
    return df.withColumn(col_nm, struct(fields_to_keep))

使用PySpark删除Dataframe的嵌套列

4 个答案: