如何在数据帧的每一列上运行udf?

时间:2018-09-06 20:08:45

标签: scala apache-spark apache-spark-sql

我有一个UDF:

val TrimText = (s: AnyRef) => {
    //does logic returns string
}

还有一个数据框:

var df = spark.read.option("sep", ",").option("header", "true").csv(root_path + "/" + file)

我想对数据帧中每一列的每个值执行TrimText

但是,问题是,我有动态的列数。我知道我可以通过df.columns获取列列表。但是我不确定这将如何帮助我解决我的问题。我该如何解决这个问题?

TLDR问题-当数据框的列数未知时,对数据框的每一列执行UDF


尝试使用:

df.columns.foldLeft( df )( (accDF, c) =>
  accDF.withColumn(c, TrimText(col(c)))
)

引发此错误:

error: type mismatch;
 found   : String
 required: org.apache.spark.sql.Column
accDF.withColumn(c, TrimText(col(c)))
假设

TrimText返回一个字符串,并且期望输入是一列中的值。因此,它将标准化整个数据帧每一行中的每个值。

3 个答案:

答案 0 :(得分:1)

您可以使用B遍历列列表,以使用UDF将foldLeft迭代地应用于DataFrame:

withColumn

答案 1 :(得分:0)

val a = sc.parallelize(Seq(("1 "," 2"),(" 3","4"))).toDF()
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def TrimText(s: Column): Column = {
//does logic returns string
  trim(s)
}
a.select(a.columns.map(c => TrimText(col(c))):_*).show

答案 2 :(得分:0)

>> I would like to perform TrimText on every value in every column in the dataframe.
>> I have a dynamic number of columns.

当sql函数可用于修剪为什么UDF时,可以为您看到下面的代码吗?

import org.apache.spark.sql.functions._

spark.udf.register("TrimText", (x:String) =>  ..... )

val df2 = sc.parallelize(List(
  (26, true, 60000.00),
  (32, false, 35000.00)
)).toDF("age", "education", "income")

val cols2 = df2.columns.toSet
df2.createOrReplaceTempView("table1")

val query = "select " + buildcolumnlst(cols2) + " from table1 "
println(query)
val dfresult = spark.sql(query)
dfresult.show()

def buildcolumnlst(myCols: Set[String]) = {
  myCols.map(x => "TrimText(" + x + ")" + " as " + x).mkString(",") 
}

结果

select trim(age) as age,trim(education) as education,trim(income) as income from table1 
+---+---------+-------+
|age|education| income|
+---+---------+-------+
| 26|     true|60000.0|
| 32|    false|35000.0|
+---+---------+-------+