如何使用数据框在Spark中查找列的最大字符串长度?

时间:2019-05-11 15:05:31

标签: scala apache-spark apache-spark-sql

我有一个数据框。我需要计算一列中String值的最大长度,并打印该值及其长度。

我已经写了下面的代码,但是这里的输出只是最大长度,而不是它的对应值。 How to get max length of string column from dataframe using scala?确实帮助我完成了以下查询。

 df.agg(max(length(col("city")))).show()

3 个答案:

答案 0 :(得分:3)

row_number() 顺序使用length('city) desc窗口功能。

然后仅过滤出 first row_number 列,并将length('city)列添加到数据框。

Ex:

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
       .toDF("city","num","country")

val win=Window.orderBy(length('city).desc)

df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win))
  .filter('rn===1)
  .show(false)

+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1  |US     |3      |1  |
+----+---+-------+-------+---+

(或)

In spark-sql:

df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC|  1|     US|      3|  1|
+----+---+-------+-------+---+

更新:

查找最小值,最大值:

val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
  .withColumn("rn", row_number().over(win_desc))
  .withColumn("rn1",row_number().over(win_asc))
  .filter('rn===1 || 'rn1 === 1)
  .show(false)

结果:

+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A   |1  |US     |1      |3  |1  | //min value of string
|ABC |1  |US     |3      |1  |3  | //max value of string
+----+---+-------+-------+---+---+

答案 1 :(得分:1)

如果您有多行共享相同的长度,则使用window函数的解决方案将不起作用,因为它会在订购后过滤掉第一行。

另一种方法是使用字符串的长度创建一个新列,找到它的max元素,然后根据获得的最大值过滤数据帧。

import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._

val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
       .toDF("city","num","country")

val dfWithLength = df.withColumn("city_length", length($"city")).cache()

dfWithLength.show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
|   A|  1|     US|          1|
|  AB|  1|     US|          2|
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()

dfWithLength.filter($"city_length" === maxValue).show()

+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC|  1|     US|          3|
| DEF|  2|     US|          3|
+----+---+-------+-----------+

答案 2 :(得分:0)

使用 pyspark 在字符串列上找到最大字符串长度

from pyspark.sql.functions import length, col, max

df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")