我具有udf函数,用于计算2个坐标之间的距离。
import org.apache.spark.sql.functions._
import scala.math._
def calculateDistance(la1:Double, lo1:Double,la2:Double,lo2:Double): Double => udf(
{
val R = 6373.0
val lat1 = toRadians(la1)
val lon1 = toRadians(lo1)
val lat2 = toRadians(la2)
val lon2 = toRadians(lo2)
val dlon = lon2 - lon1
val dlat = lat2 - lat1
val a = pow(sin(dlat / 2),2) + cos(lat1) * cos(lat2) * pow(sin(dlon / 2),2)
val c = 2 * atan2(sqrt(a), sqrt(1 - a))
val distance = R * c
}
)
这是数据框架架构。
dfcity: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [Name: string, LAT: double ... 10 more fields]
root
|-- SCITY: string (nullable = true)
|-- LAT: double (nullable = true)
|-- LON: double (nullable = true)
|-- ADD: integer (nullable = true)
|-- CODEA: integer (nullable = true)
|-- CODEB: integer (nullable = true)
|-- TCITY: string (nullable = true)
|-- TLAT: double (nullable = true)
|-- TLON: double (nullable = true)
|-- TADD: integer (nullable = true)
|-- TCODEA: integer (nullable = true)
|-- TCODEB: integer (nullable = true)
尝试使用withColumn时
val dfcitydistance = dfcity.withColumn("distance", calculateDistance($"LAT", $"LON",$"TLAT", $"TLON"))
it generates error:
6: error: too many arguments for method calculateDistance: (distance: Double)
将列传递给UDF的代码有什么问题?请指教。非常感谢。
答案 0 :(得分:1)
应该是
val calculateDistance = udf((la1:Double, lo1:Double,la2:Double,lo2:Double) => {
...
})
您现在定义的函数是一个使用局部对象并返回空UDF的函数
答案 1 :(得分:1)
您的代码有两个问题:
def calculateDistance(la1:Double, lo1:Double, la2:Double, lo2:Double): Double => udf( {
// ...
val distance = R * c
} )
val distance = R * c
是一个赋值,因此将返回一个Unit
。您应该只用distance
附加一行,或者简单地用R * c
替换赋值表达式。您的UDF应该如下所示:
val calculateDistance = udf( (la1:Double, lo1:Double, la2:Double, lo2:Double) => {
// ...
R * c
} )