Question

我有大约6000万行的spark数据帧。我想创建一个单行数据框，该框将具有所有单独列的最大值。

我尝试了以下选项，但是每个选项都有其自身的缺点-

df.select(col_list).describe().filter(summary = 'max').show()

-此查询不返回字符串列。所以我原来的数据框尺寸减小了。
df.select(max(col1).alias(col1), max(col2).alias(col2), max(col3).alias(col3), ...).show()

-此查询有效，但是当我有大约 700个奇数列时这是不利的。

有人可以提出更好的语法吗？

Answer 1

该代码将不管存在多少列或数据类型的混合而工作。

注意： OP在她的注释中建议，对于字符串列，请在分组时采用第一个non-Null值。

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  10|   5| null|  50|
|  Bob|  15|  15|Simon|  10|
| Jack|   5|   1| Timo|   3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
    ['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
    ['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
    ['col2', 'col3', 'col5']

了解first()和ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  15|  15|Simon|  50|
+-----+----+----+-----+----+

如何在Spark数据框中找到所有列的最大值

1 个答案: