将UDF应用于Spark DF中的列,并根据列

时间:2018-01-23 13:12:45

标签: scala apache-spark spark-dataframe user-defined-functions

我有一个包含字符串数据的Spark数据帧,我想映射到数字数据,如下所示(简单版本):

+--------------------+-------+----------+-------------------------+
|     participantUUID|001_Age|002_Gender|003_Where did you grow up|
+--------------------+-------+----------+-------------------------+
|010A0550-4324-490...|     23|    Female|                In a town|
|031C5411-FE42-429...|     56|      Male|                In a town|
|038688FF-B5DA-484...|     32|    Female|                In a town|
|05F8E1AF-AFDD-441...|     54|    Female|          Multiple places|
|068B213C-3303-41E...|     23|    Female|                In a town|
|11A9A444-3E93-468...|     39|    Female|                In a town|

有许多列,而不是逐列应用映射,我想在整个数据帧中逐列应用映射。

从字符串到数字的映射因列而异。例如,对于一列,字符串"差","公平","良好","非常好"会吸引1,2,3,4分;对于另一列,分数可能是4,3,2,1。所以,我想开发一个udf,它将列标题和字符串值作为参数,然后根据dataframe列应用Foldleft函数,如下所示:

val calculateScore = udf((columnName: String, answerText: String) => (columnName, answerText) match {

      case ("002_Gender", "Female") => 0
      case ("002_Gender", "Male") => 1
      case ("002_Gender", "Other") => 2

      case ("003_Where did you grow up", "In a village") => 0 
      case ("003_Where did you grow up", "In a town") => 1
      case ("003_Where did you grow up", "Multiple places") => 2
      case _ => -1
    })

val columnNames = Seq("001_Age", "002_Gender", "003_Where did you grow up")

val newDF: DataFrame = columnNames.foldLeft(baseDF)(
      (baseDF, c) =>
        baseDF.withColumn(c.concat("_numeric"), calculateScore(baseDF(c), baseDF(c)))
    )

然而,这并没有返回正确的结果 - 所有结果都显示为-1,这意味着udf无法正确匹配:

+--------------------+----------------+----------+------------------+-------------------------+---------------------------------+
|     participantUUID|assessmentNumber|002_Gender|002_Gender_numeric|003_Where did you grow up|003_Where did you grow up_numeric|
+--------------------+----------------+----------+------------------+-------------------------+---------------------------------+
|010A0550-4324-490...|               0|    Female|                -1|                In a town|                               -1|
|031C5411-FE42-429...|               0|      Male|                -1|                In a town|                               -1|
|038688FF-B5DA-484...|               0|    Female|                -1|                In a town|                               -1|
|05F8E1AF-AFDD-441...|               0|    Female|                -1|          Multiple places|                               -1|
|068B213C-3303-41E...|               0|    Female|                -1|                In a town|                               -1|

我认为这是由于calculateScore udf语句的语法,它应该获取字符串列名和答案文本并返回一个int,在列中逐行评估。换句话说,foldLeft语句的格式为:

val newDF: DataFrame = columnNames.foldLeft[DataFrame](baseDF)(
      (acc, c) =>
        acc.withColumn(c, col(c))
    )

所以calculateScore(baseDF(c), baseDF(c))应该返回一个Column类型的对象 - 但显然出现了问题。

任何想法都会非常感谢,谢谢!

NB。我已经回顾了Apply UDF to multiple columns in Spark Dataframe但我不喜欢使用var DF的想法,因为在我看来这违反了Scala中不可变编程的原则!

2 个答案:

答案 0 :(得分:0)

您将完全相同的参数传递给UDF,因此列值将作为两个参数传递,并且与默认Public Sub TestMe() Dim wf As WorksheetFunction Dim holidays(3) As Long 'As Date does not work Dim wdFive As Date Set wf = Application.WorksheetFunction holidays(0) = DateSerial(Year(Date) - 1, 12, 25) ' last Christmas holidays(1) = DateSerial(Year(Date) - 1, 12, 26) ' Last Boxing day holidays(2) = DateSerial(Year(Date), 1, 1) ' News year day wdFive = wf.WorkDay(DateSerial(Year(Date), Month(Date), 1), 4, holidays) Debug.Print wdFive End Sub

匹配

您需要将case _作为第一个参数传递。

lit(c)

答案 1 :(得分:0)

var baseDF=Seq(("Female","In a town"),("Male","Multiple places")).toDF("002_Gender","003_Where did you grow up")
 baseDF.show
+----------+-------------------------+
|002_Gender|003_Where did you grow up|
+----------+-------------------------+
|    Female|                In a town|
|      Male|          Multiple places|
+----------+-------------------------+

def calculateScore(columnName: String) = udf((answerText: String) => (columnName, answerText) match {

  case ("002_Gender", "Female") => 0
  case ("002_Gender", "Male") => 1
  case ("002_Gender", "Other") => 2

  case ("003_Where did you grow up", "In a village") => 0 
  case ("003_Where did you grow up", "In a town") => 1
  case ("003_Where did you grow up", "Multiple places") => 2
  case _ => -1
})

val columnNames = Seq("002_Gender", "003_Where did you grow up")

val newDF = columnNames.foldLeft(baseDF)(
    (baseDF, c) =>
      baseDF.withColumn(c.concat("_numeric"), calculateScore(c)(baseDF(c)))
   )
 newDF.show