Question

我是PySpark的新手。

我已经阅读了一个实木复合地板文件。我只想保留至少10个值的列

我曾经使用describe来获取每一列的非空记录数

现在如何提取值少于10个的列名，然后在写入新文件之前删除这些列

df = spark.read.parquet(file)

col_count = df.describe().filter($"summary" == "count")

Answer 1

您可以将其转换为字典，然后根据其值（计数<10，计数为 StringType（））过滤掉键（列名），需要将其转换为<在Python代码中是em> int ）：

# here is what you have so far which is a dataframe
col_count = df.describe().filter('summary == "count"')

# exclude the 1st column(`summary`) from the dataframe and save it to a dictionary
colCountDict = col_count.select(col_count.columns[1:]).first().asDict()

# find column names (k) with int(v) < 10
bad_cols = [ k for k,v in colCountDict.items() if int(v) < 10 ]

# drop bad columns
df_new = df.drop(*bad_cols)

一些注意事项：

df.describe（）

df.summary（）

使用@pault的方法。
您需要 drop（）而不是 select（）列，因为describe（）/ summary（）仅包含数字和 string 列，从{em> df.describe（）处理的列表中select插入列将丢失TimestampType（），ArrayType（）等列

Pyspark删除具有10个空值的列

1 个答案: