pyspark - 在hasrsine公式中出错

时间:2018-04-22 05:16:20

标签: pyspark haversine

我正在尝试在pyspark中实现一个haversine_distance计算器 我正在重复使用之前用过的python代码用于相同的目的,所以这就是我所做的: 1.将harvesine_distance的函数实现为udf 2.在我的数据框中,用它来计算两个纬度/长点的距离 3.当我对值进行检查时,似乎是正确的 但是当我试图在dist_km上放置一些where子句时,我收到了一个错误:

File "<stdin>", line 13, in haversine_distance
TypeError: a float is required

数据框upd_join_final66_df中的示例值:

+------------+------------+--------+---------+---------+
|delv_lat_upd|delv_lng_upd|LATITUDE|LONGITUDE|  dist_km|
+------------+------------+--------+---------+---------+
|     33.7871|     -84.382| 33.7602| -84.5398|14.893161|
|     33.7871|     -84.382| 33.7602| -84.5398|14.893161|
|     33.7874|    -84.3822| 33.7602| -84.5398|14.881769|
|     33.7874|    -84.3822| 33.7602| -84.5398|14.881769|
|     33.7874|    -84.3822| 33.7602| -84.5398|14.881769|
|     33.7874|    -84.3822| 33.7602| -84.5398|14.881769|
|     33.7874|    -84.3822| 33.7602| -84.5398|14.881769|
|     33.7872|    -84.3822| 33.7602| -84.5398| 14.87728|
|     33.7872|    -84.3822| 33.7602| -84.5398| 14.87728|
|     33.7872|    -84.3822| 33.7602| -84.5398| 14.87728|
|     33.7872|    -84.3822| 33.7602| -84.5398| 14.87728|
|     33.7872|    -84.3822| 33.7602| -84.5398| 14.87728|
|     33.7871|    -84.3821| 33.7602| -84.5398|14.884104|
|     33.7871|    -84.3821| 33.7602| -84.5398|14.884104|
|     33.7869|     -84.382| 33.7602| -84.5398|14.888724|
|     33.7871|    -84.3825| 33.7602| -84.5398|14.847878|
|     33.7869|    -84.3818| 33.7602| -84.5398|14.906844|
|     33.7869|    -84.3818| 33.7602| -84.5398|14.906844|
|     33.7869|    -84.3818| 33.7602| -84.5398|14.906844|
|     33.7869|    -84.3818| 33.7602| -84.5398|14.906844|
+------------+------------+--------+---------+---------+

代码:

from math import radians, cos, sin, asin, sqrt, atan2, pi

def haversine_distance(lat1, lon1, lat2, lon2):

    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    deg2rad = pi/180.0

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = (lon2 - lon1) 
    dlat = (lat2 - lat1) 

    a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
    c = 2.0 * asin(sqrt(a))

    #c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))

    r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles #No rounding R = 3959.87433 (miles), 6372.8(km)

    return c * r

haversine_distance_udf = udf(haversine_distance, FloatType())

upd_join_final66_df = upd_join_final66_df.withColumn('dist_km', \
                haversine_distance_udf(upd_join_final66_df['LATITUDE'],upd_join_final66_df['LONGITUDE']\
                    ,upd_join_final66_df['delv_lat_upd'],upd_join_final66_df['delv_lng_upd'])\
                )

upd_join_final66_df.registerTempTable("fac66")

当我运行以下命令时没有ERROR来进行检查

spark.sql("select delv_lat_upd, delv_lng_upd, LATITUDE, LONGITUDE, dist_km \
from fac66 \
").show()

当我试图询问dist_km时,我收到了错误

An error occurred while calling o4882.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1041.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1041.0 (TID 81684, 10.0.0.15, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
    process()
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 220, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 138, in dump_stream
    for obj in iterator:
  File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 209, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
    return lambda *a: f(*a)
  File "<stdin>", line 13, in haversine_distance
TypeError: a float is required

2 个答案:

答案 0 :(得分:0)

该问题是由导致公式出现问题的null数据引起的。

答案 1 :(得分:0)

from math import radians, cos, sin, asin, sqrt, atan2, pi
def haversine_distance(lat1, lon1, lat2, lon2):

    if lon1==None or lat1==None or lon2==None or lat2==None:
      return None
    else:
      deg2rad = pi/180.0

      # convert decimal degrees to radians 
      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

      # haversine formula 
      dlon = (lon2 - lon1) 
      dlat = (lat2 - lat1) 

      a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
      c = 2.0 * asin(sqrt(a))

      #c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))

      r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles #No rounding R = 3959.87433 (miles), 6372.8(km)

      return c * r