Question

我有一个数据帧df，它读取JSON文件的方式如下：

df = spark.read.json("/myfiles/file1.json")

df.dtypes显示以下列和数据类型：

id – string
Name - struct
address - struct
Phone - struct
start_date - string
years_with_company - int
highest_education - string
department - string
reporting_hierarchy - struct

我只想提取非结构列并创建一个数据框。例如，我得到的数据帧应仅包含id，start_date，highest_education和department。

这是我拥有的部分起作用的代码，因为我仅获得填充其中的最后一个非结构列department的值。我想收集所有非结构类型的列，然后将其转换为数据框：

names = df.schema.names

for col_name in names:
   if isinstance(df.schema[col_name].dataType, StructType):
      print("Skipping struct column %s "%(col_name))
   else:
      df1 = df.select(col_name).collect()

我很确定这可能不是最好的方法，并且我遗漏了一些我无法动弹的东西，因此，感谢您的帮助。谢谢。

Answer 1

使用列表理解：

cols_filtered = [
    c for c in df.schema.names 
    if not isinstance(df.schema[c].dataType, StructType) 
]

或者，

# Thank you @pault for the suggestion!
cols_filtered = [c for c, t in df.dtypes if t != 'struct']

现在，您可以将结果传递到df.select。

df2 = df.select(*cols_filtered)

从PySpark DataFrame删除所有StructType列

1 个答案: