pyspark数据帧上的交叉循环操作

时间:2017-11-09 11:56:58

标签: loops pyspark

我在pyspark数据帧中有两列(lat_lon)(比如DF1,DF2)。

DF1:
lat_lon
-84.412977,39.152501
-84.416946,39.153505

DF2:
lat_lon
-85.412977,39.152501
-85.416946,40.153505

我想针对DF2的每个元素与DF1的每个元素交叉循环,并使用函数计算距离。现在根据距离我将计数保存为与DF1行对应的值。

示例:

list1=[[-84.412977,39.152501],[-84.416946,39.153505]]   
list2=[[-85.412977,39.152501],[-85.416946,40.153505]]   

list4=[]

for i in range(len(list1)):
    count=0
    for j in range(len(list2)):
        if (Haversine(list1[i],list2[j]).meters )<500:
               count+= 1
    list4.append(count)

如何针对DF2对pyspark数据帧DF1的列进行循环,并将计数变量(如list4中)添加到DF1,因为pyspark数据帧不支持索引

Haversine功能:

import math

class Haversine:
    '''
    use the haversine class to calculate the distance between
    two lon/lat coordnate pairs.
    output distance available in kilometers, meters, miles, and feet.
    example usage: Haversine([lon1,lat1],[lon2,lat2]).feet

    '''
    def __init__(self,coord1,coord2):
        lat1,lon1=coord1
        lat2,lon2=coord2

        R=6371000                               # radius of Earth in meters
        phi_1=math.radians(lat1)
        phi_2=math.radians(lat2)

        delta_phi=math.radians(lat2-lat1)
        delta_lambda=math.radians(lon2-lon1)

        a=math.sin(delta_phi/2.0)**2+\
           math.cos(phi_1)*math.cos(phi_2)*\
           math.sin(delta_lambda/2.0)**2
        c=2*math.atan2(math.sqrt(a),math.sqrt(1-a))

        self.meters=R*c                         # output distance in meters
        self.km=self.meters/1000.0              # output distance in kilometers
        self.miles=self.meters*0.000621371      # output distance in miles
        self.feet=self.miles*5280               # output distance in feet

if __name__ == "__Haversine__":
    main()

我使用建议的解决方案时出错:

var=df1.crossJoin(df2)\
    .withColumn("meters", haversine_udf(df1.lat, df1.lon, df2.lat, df2.lon))\
    .filter("meters < 500")\
    .groupBy(df1.lat, df1.lon)\
    .count()

var.schema
StructType(List(StructField(lat,DoubleType,true),StructField(lon,DoubleType,true),StructField(count,LongType,false)))


var.select('count').show(1)

Py4JJavaError: An error occurred while calling o4346.showString.
: java.lang.RuntimeException: Invalid PythonUDF <lambda>(lat#126, lon#127, lat#131, lon#132), requires attributes from more than one child.

2 个答案:

答案 0 :(得分:1)

首先让我们创建数据帧

df1 = spark.createDataFrame(list1, ['lat', 'long'])
df2 = spark.createDataFrame(list2, ['lat', 'long'])

您必须使用UDF功能创建Haversine

import pyspark.sql.functions as psf
from pyspark.sql.types import DoubleType

haversine_udf = psf.udf(lambda lat1, long1, lat2, long2: Haversine([lat1, long1], [lat2, long2]).meters, DoubleType())

最后,要针对df2的每个元素获取df1的每个元素,您可以使用crossJoin(代价高昂):

df1.crossJoin(df2)\
    .withColumn("meters", haversine_udf(df1.lat, df1.long, df2.lat, df2.long))\
    .filter("meters < 500")\
    .groupBy(df1.lat, df1.long)\
    .count()

如果数据帧较小,您可以broadcast增加一个数据帧的计算量,这会将其复制到每个节点的内存中:

df1.crossJoin(psf.broadcast(df2))

答案 1 :(得分:0)

这对我有用:

id1=ds1.select('lat1','lon1').rdd.map(lambda l: (l[0],l[1]))
id2=ds2.select('lat2','lon2').rdd.map(lambda l: (l[0],l[1]))

mm=id1.cartesian(id2)
kk=mm.map(lambda l: Haversine(l[0],l[1]).meters)