获取数据帧中每一列的最大长度

时间:2019-08-12 16:00:47

标签: apache-spark apache-spark-sql

我有一个这样的Spark数据框

+-----------------+---------------+----------+-----------+
|     column1     |    column2    | column3  |  column4  |
+-----------------+---------------+----------+-----------+
| a               | bbbbb         | cc       | >dddddddd |
| >aaaaaaaaaaaaaa | bb            | c        | dddd      |
| aa              | >bbbbbbbbbbbb | >ccccccc | ddddd     |
| aaaaa           | bbbb          | ccc      | d         |
+-----------------+---------------+----------+-----------+

我想在每一列中找到最长的元素的长度,以获得类似的结果

+---------+-----------+
| column  | maxLength |
+---------+-----------+
| column1 |        14 |
| column2 |        12 |
| column3 |         7 |
| column4 |         8 |
+---------+-----------+

我知道如何逐列进行操作,但不知道如何告诉Spark-Do it for all columns

我正在使用Scala Spark。

1 个答案:

答案 0 :(得分:1)

您可以使用agg函数maxlength函数来实现

val x = df.columns.map(colName => {
  (colName, df.agg(max(length(col(colName)))).head().getAs[Integer](0))
}).toSeq.toDF("column", "maxLength")

输出:

+-------+---------+
|column |maxLength|
+-------+---------+
|column1|14       |
|column2|13       |
|column3|8        |
|column4|9        |
+-------+---------+

其他方法是

df.select(df.columns.map(c => max(length(col(c))).as(s"max_${c}")): _*)

输出:

+-----------+-----------+-----------+-----------+
|max_column1|max_column2|max_column3|max_column4|
+-----------+-----------+-----------+-----------+
|14         |13         |8          |9          |
+-----------+-----------+-----------+-----------+