我在pyspark数据帧中有两列(lat_lon)(比如DF1,DF2)。
DF1:
lat_lon
-84.412977,39.152501
-84.416946,39.153505
DF2:
lat_lon
-85.412977,39.152501
-85.416946,40.153505
我想针对DF2的每个元素与DF1的每个元素交叉循环,并使用函数计算距离。现在根据距离我将计数保存为与DF1行对应的值。
示例:
list1=[[-84.412977,39.152501],[-84.416946,39.153505]]
list2=[[-85.412977,39.152501],[-85.416946,40.153505]]
list4=[]
for i in range(len(list1)):
count=0
for j in range(len(list2)):
if (Haversine(list1[i],list2[j]).meters )<500:
count+= 1
list4.append(count)
如何针对DF2对pyspark数据帧DF1的列进行循环,并将计数变量(如list4中)添加到DF1,因为pyspark数据帧不支持索引
Haversine功能:
import math
class Haversine:
'''
use the haversine class to calculate the distance between
two lon/lat coordnate pairs.
output distance available in kilometers, meters, miles, and feet.
example usage: Haversine([lon1,lat1],[lon2,lat2]).feet
'''
def __init__(self,coord1,coord2):
lat1,lon1=coord1
lat2,lon2=coord2
R=6371000 # radius of Earth in meters
phi_1=math.radians(lat1)
phi_2=math.radians(lat2)
delta_phi=math.radians(lat2-lat1)
delta_lambda=math.radians(lon2-lon1)
a=math.sin(delta_phi/2.0)**2+\
math.cos(phi_1)*math.cos(phi_2)*\
math.sin(delta_lambda/2.0)**2
c=2*math.atan2(math.sqrt(a),math.sqrt(1-a))
self.meters=R*c # output distance in meters
self.km=self.meters/1000.0 # output distance in kilometers
self.miles=self.meters*0.000621371 # output distance in miles
self.feet=self.miles*5280 # output distance in feet
if __name__ == "__Haversine__":
main()
我使用建议的解决方案时出错:
var=df1.crossJoin(df2)\
.withColumn("meters", haversine_udf(df1.lat, df1.lon, df2.lat, df2.lon))\
.filter("meters < 500")\
.groupBy(df1.lat, df1.lon)\
.count()
var.schema
StructType(List(StructField(lat,DoubleType,true),StructField(lon,DoubleType,true),StructField(count,LongType,false)))
var.select('count').show(1)
Py4JJavaError: An error occurred while calling o4346.showString.
: java.lang.RuntimeException: Invalid PythonUDF <lambda>(lat#126, lon#127, lat#131, lon#132), requires attributes from more than one child.
答案 0 :(得分:1)
首先让我们创建数据帧
df1 = spark.createDataFrame(list1, ['lat', 'long'])
df2 = spark.createDataFrame(list2, ['lat', 'long'])
您必须使用UDF
功能创建Haversine
:
import pyspark.sql.functions as psf
from pyspark.sql.types import DoubleType
haversine_udf = psf.udf(lambda lat1, long1, lat2, long2: Haversine([lat1, long1], [lat2, long2]).meters, DoubleType())
最后,要针对df2的每个元素获取df1
的每个元素,您可以使用crossJoin
(代价高昂):
df1.crossJoin(df2)\
.withColumn("meters", haversine_udf(df1.lat, df1.long, df2.lat, df2.long))\
.filter("meters < 500")\
.groupBy(df1.lat, df1.long)\
.count()
如果数据帧较小,您可以broadcast
增加一个数据帧的计算量,这会将其复制到每个节点的内存中:
df1.crossJoin(psf.broadcast(df2))
答案 1 :(得分:0)
这对我有用:
id1=ds1.select('lat1','lon1').rdd.map(lambda l: (l[0],l[1]))
id2=ds2.select('lat2','lon2').rdd.map(lambda l: (l[0],l[1]))
mm=id1.cartesian(id2)
kk=mm.map(lambda l: Haversine(l[0],l[1]).meters)