模式匹配范围在Scala与Spark udf

时间:2018-01-04 16:18:49

标签: scala apache-spark pattern-matching user-defined-functions

我有一个Spark DataFrame,其中包含我使用Likert量表与数字分数匹配的字符串。不同的问题Ids映射到不同的分数。我试图在Apache Spark udf中对Scala中的一个范围进行模式匹配,使用这个问题作为指导:

How can I pattern match on a range in Scala?

但是当我使用范围而不是简单的OR语句时,我收到了编译错误, 即。

31 | 32 | 33 | 34工作正常

31 to 35无法编译。请问我在语法上出错了吗?

另外,在最后的情况_中,我想要映射到String而不是Int, case _ => "None"但这会产生错误: java.lang.UnsupportedOperationException: Schema for type Any is not supported

据推测,这是Spark的一般性问题,因为在本机Scala中返回Any是完全可能的吗?

这是我的代码:

def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {

      case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 //this is fine
      case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
      case ((31 | 32 | 33 | 34 | 35), "Often") => 2
      case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
      case ((x if 41 until 55 contains x), "None of the time") => 1 //this line won't compile
      case _ => 0 //would like to map to "None"
    })

然后在Spark DataFrame上使用udf,如下所示:

val df3 = df.withColumn("NumericScore", calculateScore(df("QuestionId"), df("AnswerText")))

2 个答案:

答案 0 :(得分:2)

保护表达式应放在模式之后:

def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
  case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => 4 
  case ((31 | 32 | 33 | 34 | 35), "Occasionally") => 3
  case ((31 | 32 | 33 | 34 | 35), "Often") => 2
  case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => 1
  case (x, "None of the time") if 41 until 55 contains x => 1
  case _ => 0 //would like to map to "None"
})

答案 1 :(得分:2)

如果您想将最后case,即case _映射到“无”String,那么所有案例都应该返回String

以下udf功能应该适合您

def calculateScore  = udf((questionId: Int, answerText: String) => (questionId, answerText) match {
  case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => "4" //this is fine
  case ((31 | 32 | 33 | 34 | 35), "Occasionally") => "3"
  case ((31 | 32 | 33 | 34 | 35), "Often") => "2"
  case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => "1"
  case (x, "None of the time") if (x >= 41 && x < 55) => "1" //this line won't compile
  case _ => "None"
})

如果您要将最后casecase _映射到None,则需要将其他返回类型更改为Option的子项{{1} }}是None

的孩子

以下代码也适用于您

Option

最后一点是,您def calculateScore = udf((questionId: Int, answerText: String) => (questionId, answerText) match { case ((31 | 32 | 33 | 34 | 35), "Rarely /<br>Never") => Some(4) //this is fine case ((31 | 32 | 33 | 34 | 35), "Occasionally") => Some(3) case ((31 | 32 | 33 | 34 | 35), "Often") => Some(2) case ((31 | 32 | 33 | 34 | 35), "Almost always /<br>Always") => Some(1) case (x, "None of the time") if (x >= 41 && x < 55) => Some(1) //this line won't compile case _ => None }) 的错误消息明确指出不支持返回类型为java.lang.UnsupportedOperationException: Schema for type Any is not supported的{​​{1}}函数。 <{1}}中的所有udf都应保持一致。