udf,scala中的方法中的参数过多

时间:2018-08-01 22:56:30

标签: scala function apache-spark dataframe user-defined-functions

我具有udf函数,用于计算2个坐标之间的距离。

import org.apache.spark.sql.functions._
import scala.math._

def  calculateDistance(la1:Double, lo1:Double,la2:Double,lo2:Double): Double   =>  udf(
{

val  R = 6373.0
val  lat1 = toRadians(la1)
val  lon1 = toRadians(lo1)
val  lat2 = toRadians(la2)
val  lon2 = toRadians(lo2)

val  dlon = lon2 - lon1
val  dlat = lat2 - lat1

val  a = pow(sin(dlat / 2),2) + cos(lat1) * cos(lat2) * pow(sin(dlon / 2),2)
val  c = 2 * atan2(sqrt(a), sqrt(1 - a))

val  distance = R * c
}
)

这是数据框架架构。

dfcity: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Name: string, LAT: double ... 10 more fields]
root
|-- SCITY: string (nullable = true)
|-- LAT: double (nullable = true)
|-- LON: double (nullable = true)
|-- ADD: integer (nullable = true)
|-- CODEA: integer (nullable = true)
|-- CODEB: integer (nullable = true)
|-- TCITY: string (nullable = true)
|-- TLAT: double (nullable = true)
|-- TLON: double (nullable = true)
|-- TADD: integer (nullable = true)
|-- TCODEA: integer (nullable = true)
|-- TCODEB: integer (nullable = true)

尝试使用withColumn时

val dfcitydistance = dfcity.withColumn("distance", calculateDistance($"LAT", $"LON",$"TLAT", $"TLON"))
it generates error:
6: error: too many arguments for method calculateDistance: (distance: Double)

将列传递给UDF的代码有什么问题?请指教。非常感谢。

2 个答案:

答案 0 :(得分:1)

应该是

val calculateDistance = udf((la1:Double, lo1:Double,la2:Double,lo2:Double) => {
  ...
})

您现在定义的函数是一个使用局部对象并返回空UDF的函数

答案 1 :(得分:1)

您的代码有两个问题:

def calculateDistance(la1:Double, lo1:Double, la2:Double, lo2:Double): Double => udf( {
  // ...
  val distance = R * c
} )
  1. 要创建UDF,应将整个Scala函数包装为方法udf的参数。
  2. 在Scala中,函数主体中的最后一个表达式是函数返回的内容。表达式val distance = R * c是一个赋值,因此将返回一个Unit。您应该只用distance附加一行,或者简单地用R * c替换赋值表达式。

您的UDF应该如下所示:

val calculateDistance = udf( (la1:Double, lo1:Double, la2:Double, lo2:Double) => {
  // ...
  R * c
} )