计算Dataframes中列的平均字符串长度

时间:2016-06-27 16:46:54

标签: scala apache-spark dataframe spark-dataframe

我迷失了如何使用scala计算数据框中任何列的平均字符串长度。我已经能够轻松地为数字列执行以下操作

val avgDF = df.dtypes.filter(x => x._2 == "DoubleType").map(ct =>avg(col(ct._1))).toList

2 个答案:

答案 0 :(得分:4)

import org.apache.spark.sql.functions._

val avgDF = df.agg(mean(length(col("yourColumn"))))

答案 1 :(得分:0)

val findLength = udf {(ColValue:String)=> ColValue.size}

  myData.dtypes.filter(x=>x._2=="StringType").foreach(f=>
  myData.select(avg(findLength(col(f._1)))).show()      
  )

样本数据

Name|Age|email
Hari|12|hary@h0otmail.ocm
Hari|12|hary@h0otmail.ocm
Hari|12|hary@h0otmail.ocm
Hari|12|hary@h0otmail.ocm
Hasasasi|12|hary@h0otmail.in

输出

+-------------------+
|AVG(scalaUDF(Name))|
+-------------------+
|                4.8|
+-------------------+


+--------------------+
|AVG(scalaUDF(email))|
+--------------------+
|                16.8|
+--------------------+