Question

今天早上我们将Spark版本从2.2.0更新到2.3.0，我遇到了一个非常奇怪的问题。

我有一个UDF（），计算2点之间的距离

private static UDF4<Double, Double, Double, Double, Double> calcDistance =
        (UDF4<Double, Double, Double, Double, Double>) (lat, lon, meanLat, meanLon) ->
                GeoUtils.calculateDistance(lat, lon, meanLat, meanLon);

注册UDF

spark.udf().register("calcDistance", calcDistance, DataTypes.DoubleType);

我有一个以下结构的数据帧（这个DF是hpan字段连接2个DF的结果）

root
|-- hpan: string (nullable = true)
|-- atmid: string (nullable = true)
|-- reqamt: long (nullable = true)
|-- mcc_code: string (nullable = true)
|-- utime: string (nullable = true)
|-- udate: string (nullable = true)
|-- address_city: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
|-- gmt_msk_offset: integer (nullable = true)
|-- utimeWithTZ: timestamp (nullable = true)
|-- weekDay: integer (nullable = true)
|-- location_type: string (nullable = true)
|-- mean_lat: double (nullable = true)
|-- mean_lon: double (nullable = true)

所以我想要的是添加一个距离介于（lat，lon）和（mean_lat，mean_lon）之间的列;

svWithCoordsTzAndDistancesDF.withColumn("distance",
    callUDF("calcDistance",col("latitude"), col("longitude"), 
            col("mean_lat"), col("mean_lon")));

它在Spark 2.2上运行良好，但在v2.3上开始失败

是个例外

线程中的异常＆＃34; main＆＃34; org.apache.spark.sql.AnalysisException：   已解决的属性＆＃39; mean_lon，＆＃39; mean_lat，＆＃39;经度，＆＃39;纬度丢失   来自gmt_msk_offset＃147，utime＃3，经度＃146，addre
  ss_city＃141，＃uDate公司29，mean_lon＃371，＃平日230，reqamt＃4L，纬度＃145，＃mean_lat 369，LOCATION_TYPE＃243，HPAN＃1，＃utimeWithTZ 218，mcc_code＃14，＃atmid 9   在运营商＆＃39;项目[hpan＃1，atmid＃9，reqamt＃4L，   mcc_code＃14，utime＃3，udate＃29，address_city＃141，纬度＃145，   经度＃146，gmt_msk_offset＃147，utimeWithTZ＃218，weekDay＃230，   location_type＃243，mean_lat＃369，mean_lon＃371，＆＃39; calcDist
  ance（＆＃39;纬度，＆＃39;经度，＆＃39; mean_lat，＆＃39; mean_lon）AS距离＃509]。   操作中出现具有相同名称的属性：   mean_lon，mean_lat，经度，纬度。请检查是否合适   使用了属性。;;

我尝试在UDF（）中的cols中添加别名，就像这样

svWithCoordsTzAndDistancesDF.withColumn("distance",
        callUDF("calcDistance",col("latitude").as("a"), col("longitude").as("b"), col("mean_lat").as("c"), col("mean_lon").as("d")));

或者以scala序列

包装此cols

svWithCoordsTzAndDistancesDF.withColumn("distance",
        callUDF("calcDistance",JavaConverters.collectionAsScalaIterableConverter(Arrays.asList
                (col("latitude"), col("longitude"), col("mean_lat"), col("mean_lon")))
                .asScala()
                .toSeq()));

这些尝试都没有解决问题。

也许有人知道这个问题的解决方法？

转换流程就像这样

ParentDF -> childDF1(as parentDF.groupBy().agg(mean())), childDF2(parentDF.filter('condition')) -> svWithCoordsTzAndDistancesDF (join childDF1 and childDF2).

我认为问题可能出在为此流程构建的执行计划中......

Answer 1

这是一种魔力。当我指定列的数据框并添加select("*")时 - 它可以正常工作。如果有人能够解释它 - 我会非常感激

df = df.select("*")
       .withColumn("distance", callUDF("calcDistance",
                df.col("latitude"),
                df.col("longitude"),
                df.col("mean_lat"),
                df.col("mean_lon")))
      .toDF();

AnalysisException callUDF（）在withColumn（）里面

1 个答案: