矢量化Pyspark地心距计算

时间:2018-03-14 19:30:59

标签: python pyspark vectorization databricks

我有2个数据集:

df:{' key':int,'纬度':num,'经度':num}
分:{'纬度':num,'经度':num,'收入':num}

我基本上需要计算df中每行之间的距离和每行的分数,当距离<= 5000米时,将收入相加。

输出需要是每个距离<= 5000米的收入总和。

我编写了一个非矢量化解决方案,如下所示,这似乎有效,但它需要永远。我需要帮助来矢量化它。我甚至不知道从哪里开始。

我使用pyspark在databricks中完成所有这些操作。

谢谢!

from math import sin, cos, sqrt, atan2, radians

def calcdist(lat1, lon1, lat2, lon2):  # funcao calcula distancia em metros entre dois pontos
  R = 6371
  lat1 = radians(lat1)
  lon1 = radians(lon1)
  lat2 = radians(lat2)
  lon2 = radians(lon2)
  dlon = lon2 - lon1
  dlat = lat2 - lat1
  a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
  c = 2 * atan2(sqrt(a), sqrt(1 - a))
  distance = R * c * 1000
  return(distance)

rdd = []
x = 0
y = 0
z = 0

for row in df.toLocalIterator():
  for row2 in cent.toLocalIterator():
    d = calcdist(lat1 = float(row.lat), lon1 = float(row.lon), lat2 = float(row2.latitude), lon2 = float(row2.longitude))
    if (d is not None and d <= 5000):
      x = x + row2.income * row2.pop
      y = y + row2.pop
      z = z + row2.houses
    else:
      x = x + 0
      y = y + 0
      z = z + 0
  rdd.append((row.key, x, y, z))
  x = 0
  y = 0
  z = 0

rdd = sc.parallelize(rdd)

df2=rdd.toDF(['id','soma_faixa_renda', 'total_pop', 'total_houses'])

df2.show()

1 个答案:

答案 0 :(得分:0)

我想你有2个数据框,列[id, latitude1, longtitude1][latitude2, longtitude2, income](似乎key不是一个好的列名,可能会引发一些错误)。首先,您可以使udf函数计算距离:

import pyspark.sql.functions as fn
from pyspark.sql.types import *
calcdist_udf = fn.udf(calcdist, returnType=DoubleType())

然后,您可以交叉连接数据框并应用创建的udf函数来计算距离

df_join = df.crossJoin(cent)
df_dist = df_join.select(fn.col('id'), fn.col('income'), 
                         calcdist_udf('latitude1', 'longtitude1', 'latitude2', 'longtitude2').alias('distance'))
df_dist.cache() # cache first because of bug in Spark

现在,您有一个计算距离的数据框。只需按低距离过滤行并按需要聚合

df_agg = df_dist.where(df_dist['distance'] <= 5000).groupby('id').agg(fn.sum('income').alias('sum_income'), fn.count('income').alias('n_house'))
df_agg_pandas = df_agg.toPandas() # save to pandas dataframe