我有一个数据框。我需要计算一列中String值的最大长度,并打印该值及其长度。
我已经写了下面的代码,但是这里的输出只是最大长度,而不是它的对应值。 How to get max length of string column from dataframe using scala?确实帮助我完成了以下查询。
df.agg(max(length(col("city")))).show()
答案 0 :(得分:3)
按 row_number()
顺序使用length('city) desc
窗口功能。
然后仅过滤出 first row_number
列,并将length('city)
列添加到数据框。
Ex:
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"))
.toDF("city","num","country")
val win=Window.orderBy(length('city).desc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win))
.filter('rn===1)
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len|rn |
+----+---+-------+-------+---+
|ABC |1 |US |3 |1 |
+----+---+-------+-------+---+
(或)
In spark-sql:
df.createOrReplaceTempView("lpl")
spark.sql("select * from (select *,length(city)str_len,row_number() over (order by length(city) desc)rn from lpl)q where q.rn=1")
.show(false)
+----+---+-------+-------+---+
|city|num|country|str_len| rn|
+----+---+-------+-------+---+
| ABC| 1| US| 3| 1|
+----+---+-------+-------+---+
更新:
查找最小值,最大值:
val win_desc=Window.orderBy(length('city).desc)
val win_asc=Window.orderBy(length('city).asc)
df.withColumn("str_len",length('city))
.withColumn("rn", row_number().over(win_desc))
.withColumn("rn1",row_number().over(win_asc))
.filter('rn===1 || 'rn1 === 1)
.show(false)
结果:
+----+---+-------+-------+---+---+
|city|num|country|str_len|rn |rn1|
+----+---+-------+-------+---+---+
|A |1 |US |1 |3 |1 | //min value of string
|ABC |1 |US |3 |1 |3 | //max value of string
+----+---+-------+-------+---+---+
答案 1 :(得分:1)
如果您有多行共享相同的长度,则使用window函数的解决方案将不起作用,因为它会在订购后过滤掉第一行。
另一种方法是使用字符串的长度创建一个新列,找到它的max元素,然后根据获得的最大值过滤数据帧。
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import spark.implicits._
val df=Seq(("A",1,"US"),("AB",1,"US"),("ABC",1,"US"), ("DEF", 2, "US"))
.toDF("city","num","country")
val dfWithLength = df.withColumn("city_length", length($"city")).cache()
dfWithLength.show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| A| 1| US| 1|
| AB| 1| US| 2|
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
val Row(maxValue: Int) = dfWithLength.agg(max("city_length")).head()
dfWithLength.filter($"city_length" === maxValue).show()
+----+---+-------+-----------+
|city|num|country|city_length|
+----+---+-------+-----------+
| ABC| 1| US| 3|
| DEF| 2| US| 3|
+----+---+-------+-----------+
答案 2 :(得分:0)
使用 pyspark 在字符串列上找到最大字符串长度
from pyspark.sql.functions import length, col, max
df2 = df.withColumn("len_Description",length(col("Description"))).groupBy().max("len_Description")