所以我有以下数据框(简化)
df1 = propslat prosplong type
50 45 prosp1
34 -25 prosp2
df2 = complat complong type
29 58 competitor1
68 34 competitor2
我想执行以下操作-为该潜在客户与每个竞争对手之间的每个潜在客户(总计74万个潜在客户)进行距离计算,因此从理论上讲,输出结果将如下所示:
df3 = d_p(x)_to_c1 d_p(x)_to_c2 d_p(x)_to_c3
234.34 895.34 324.5
输出的每一行都是新的前景。
我当前的代码如下:
prospectsarray=[]
prosparr = []
for i, row in prospcords.iterrows():
lat1 = row['prosplat']
lon2 = row['prosplong']
coords= [lat1,lon2]
distancearr2 = []
for x, row2 in compcords.iterrows():
lat2 = row2['complat']
lon2 = row2['complong']
coords2 = [lat2,lon2]
distance = geopy.distance.distance(coords, coords2).miles
if distance > 300:
distance = 0
distancearr2.append(distance)
prosparr.append(distancearr2)
prospectsarray.extend(prosparr)
dfprosp = pd.DataFrame(prospectsarray)
虽然达到了我的目标,但速度却非常慢。
我尝试了以下优化,但是输出没有迭代,仍然使用了我试图避免的迭代。
competitorlist = []
def distancecalc(df):
distance_list = []
for i in range(0, len(prospcords)):
coords2 = [prospcords.iloc[i]['prosplat'],prospcords.iloc[i]['prosplong']]
d = geopy.distance.distance(coords1,coords2).miles
print(d)
if d>300:
d=0
distance_list.append(d)
competitorlist.append(distance_list)
for x, row2 in compcords.iterrows():
lat2 = row2['complat']
lon2 = row2['complong']
coords1 = [lat2,lon2]
distancecalc(prospcords)
print(distance_list)
答案 0 :(得分:1)
我的猜测是,大多数执行时间都花在geopy.distance.distance()中。您可以使用cProfile或其他计时工具来确认这一点。
根据distance上的geopy文档,它使用地球的椭圆模型计算两点之间的测地距离。看来该算法非常准确:他们将其与“仅精确到0.2毫米”的已弃用算法进行了比较。我的猜测是测地距离有点耗时。
它们还具有功能great_cirlce(geopy.distance.great_circle),该函数使用地球的球形模型。因为地球不是真正的球体,所以它的“误差约为0.5%”。因此,如果实际距离为100(英里/公里),则可能会偏离半英里/公里。再次,只是猜测,但是我怀疑该算法比测地线算法更快。
如果您可以忍受应用程序中的潜在错误,请尝试使用great_circle()而不是distance()
答案 1 :(得分:0)
首先,您应该注意所提供的信息。您提供的数据框列名称与您的代码不兼容... 另外,一些解释对您尝试做的事情很有帮助。
无论如何,这是我的解决方案:
import pandas as pd
from geopy import distance
compCords = pd.DataFrame(
{'compLat': [20.0, 13.0, 14.0], 'compLong': [-15.0, 5.0, -1.2]})
prospCords = pd.DataFrame(
{'prospLat': [21.0, 12.1, 13.0], 'prospLong': [-14.0, 2.2, 2.0]})
def distanceCalc(compCoord):
# return the list of result instead of using append() method
propsDist = prospCords.apply(
lambda row: distance.distance(
compCoord, [
row['prospLat'], row['prospLong']]).miles, axis=1)
# clean data in a pandas Series
return propsDist.apply(lambda d: 0. if d > 300 else d)
# Here too return the list through the output
compDist = compCords.apply(lambda row: distanceCalc(
[row['compLat'], row['compLong']]), axis=1)
dfProsp = pd.DataFrame(compDist)
注意:您的问题是,当您使用诸如apply和function之类的内容时,您应该以“功能性”的方式进行思考:通过函数的输入和输出传递您所需的大部分内容,而不是使用诸如通过append
或extend
函数将元素附加到全局列表变量的技巧,因为它们是“副作用”,并且副作用与诸如apply function(或“ map”通常在函数式编程中被称为)。
答案 2 :(得分:0)
这是我能制造的最快的速溶蛋白!
compuid=np.array(df.iloc[0:233,0])
complat = np.array(df.iloc[0:233,3])
complong = np.array(df.iloc[0:233,4])
custlat=np.array(df.iloc[234:,3])
custlong=np.array(df.iloc[234:,4])
ppmmasterlist=[]
mergedlist=[]
for x,y in np.nditer([custlat,custlong]):
"""
Taking the coords1 from the numpy array's using x,y
as index and calling those into the coords1 list.
"""
coords1=[x,y]
"""
Instatiating Distance collection list
and List greater than 0
As well as the pipeline list
"""
distcoll=[]
listGreaterThan0=[]
ppmlist=[]
ppmdlist=[]
z=0
for p,q in np.nditer([complat,complong]):
"""
Taking the coords2 from the numpy array's using p,q
as index and calling those into the coords1 list.
"""
coords2=[p,q]
distance = great_circle(coords1,coords2).miles
if distance>= 300:
distance=0
di=0
elif distance <300:
di=((300-distance)/300)
distcoll.append(distance)
distcoll.append(compuid[z])
if di > 0:
listGreaterThan0.append(di)
listGreaterThan0.append(compuid[z])
if z >= 220:
ppmlist.append(di)
ppmdlist.append(distance)
z+=1
sumval=[sum(ppmlist)]
sumval1 = [sum(listGreaterThan0[::2])]
mergedlist = ppmlist+sumval+ppmdlist+sumval1+listGreaterThan0
mergedlist.extend(distcoll)
#rint(mergedlist)
#ppmmasterlist += [mergedlist]
ppmmasterlist.append(mergedlist)
df5 = pd.DataFrame(ppmmasterlist)