如何使用Apply优化此代码? (迭代次数)

时间:2018-12-01 22:38:48

标签: python pandas

所以我有以下数据框(简化)

    df1 = propslat    prosplong     type
           50     45       prosp1
           34      -25     prosp2


    df2 = complat     complong     type
           29      58      competitor1
           68      34      competitor2

我想执行以下操作-为该潜在客户与每个竞争对手之间的每个潜在客户(总计74万个潜在客户)进行距离计算,因此从理论上讲,输出结果将如下所示:

    df3 = d_p(x)_to_c1         d_p(x)_to_c2      d_p(x)_to_c3
          234.34                895.34            324.5

输出的每一行都是新的前景。

我当前的代码如下:

    prospectsarray=[]

    prosparr = []



    for i, row in prospcords.iterrows():
        lat1 = row['prosplat']
        lon2 = row['prosplong']
        coords= [lat1,lon2]
        distancearr2 = []

        for x, row2 in compcords.iterrows():
            lat2 = row2['complat']
            lon2 = row2['complong']
            coords2 = [lat2,lon2]
            distance = geopy.distance.distance(coords, coords2).miles
            if distance > 300:
                distance = 0

            distancearr2.append(distance)
        prosparr.append(distancearr2)
    prospectsarray.extend(prosparr)
    dfprosp = pd.DataFrame(prospectsarray)

虽然达到了我的目标,但速度却非常慢。

我尝试了以下优化,但是输出没有迭代,仍然使用了我试图避免的迭代。

    competitorlist = []
    def distancecalc(df):
        distance_list = []
        for i in range(0, len(prospcords)):
            coords2 = [prospcords.iloc[i]['prosplat'],prospcords.iloc[i]['prosplong']]
            d = geopy.distance.distance(coords1,coords2).miles
            print(d)
            if d>300:
                d=0
            distance_list.append(d)
        competitorlist.append(distance_list)




    for x, row2 in compcords.iterrows():
        lat2 = row2['complat']
        lon2 = row2['complong']
        coords1 = [lat2,lon2]
        distancecalc(prospcords)
        print(distance_list)

3 个答案:

答案 0 :(得分:1)

我的猜测是,大多数执行时间都花在geopy.distance.distance()中。您可以使用cProfile或其他计时工具来确认这一点。

根据distance上的geopy文档,它使用地球的椭圆模型计算两点之间的测地距离。看来该算法非常准确:他们将其与“仅精确到0.2毫米”的已弃用算法进行了比较。我的猜测是测地距离有点耗时。

它们还具有功能great_cirlce(geopy.distance.great_circle),该函数使用地球的球形模型。因为地球不是真正的球体,所以它的“误差约为0.5%”。因此,如果实际距离为100(英里/公里),则可能会偏离半英里/公里。再次,只是猜测,但是我怀疑该算法比测地线算法更快。

如果您可以忍受应用程序中的潜在错误,请尝试使用great_circle()而不是distance()

答案 1 :(得分:0)

首先,您应该注意所提供的信息。您提供的数据框列名称与您的代码不兼容... 另外,一些解释对您尝试做的事情很有帮助。

无论如何,这是我的解决方案:

import pandas as pd
from geopy import distance

compCords = pd.DataFrame(
{'compLat': [20.0, 13.0, 14.0], 'compLong': [-15.0, 5.0, -1.2]})
prospCords = pd.DataFrame(
{'prospLat': [21.0, 12.1, 13.0], 'prospLong': [-14.0, 2.2, 2.0]})


def distanceCalc(compCoord):
    # return the list of result instead of using append() method
    propsDist = prospCords.apply(
        lambda row: distance.distance(
            compCoord, [
                row['prospLat'], row['prospLong']]).miles, axis=1)
    # clean data in a pandas Series
    return propsDist.apply(lambda d: 0. if d > 300 else d)

# Here too return the list through the output
compDist = compCords.apply(lambda row: distanceCalc(
    [row['compLat'], row['compLong']]), axis=1)

dfProsp = pd.DataFrame(compDist)

注意:您的问题是,当您使用诸如apply和function之类的内容时,您应该以“功能性”的方式进行思考:通过函数的输入和输出传递您所需的大部分内容,而不是使用诸如通过appendextend函数将元素附加到全局列表变量的技巧,因为它们是“副作用”,并且副作用与诸如apply function(或“ map”通常在函数式编程中被称为)。

答案 2 :(得分:0)

这是我能制造的最快的速溶蛋白!

compuid=np.array(df.iloc[0:233,0])
complat = np.array(df.iloc[0:233,3])
complong = np.array(df.iloc[0:233,4])
custlat=np.array(df.iloc[234:,3])
custlong=np.array(df.iloc[234:,4])


ppmmasterlist=[]
mergedlist=[]
for x,y in np.nditer([custlat,custlong]):

    """
    Taking the coords1 from the numpy array's using x,y
    as index and calling those into the coords1 list.
    """
    coords1=[x,y]
    """
    Instatiating Distance collection list
    and List greater than 0
    As well as the pipeline list
    """
    distcoll=[]
    listGreaterThan0=[]
    ppmlist=[]
    ppmdlist=[]
    z=0
    for p,q in np.nditer([complat,complong]):
        """
        Taking the coords2 from the numpy array's using p,q
        as index and calling those into the coords1 list.
        """
        coords2=[p,q]
        distance = great_circle(coords1,coords2).miles
        if distance>= 300:
            distance=0
            di=0
        elif distance <300:
            di=((300-distance)/300)
            distcoll.append(distance)
            distcoll.append(compuid[z])
        if di > 0:
            listGreaterThan0.append(di)
            listGreaterThan0.append(compuid[z])
        if z >= 220:
            ppmlist.append(di)
            ppmdlist.append(distance)
        z+=1
    sumval=[sum(ppmlist)]
    sumval1 = [sum(listGreaterThan0[::2])]
    mergedlist = ppmlist+sumval+ppmdlist+sumval1+listGreaterThan0
    mergedlist.extend(distcoll)
    #rint(mergedlist)
    #ppmmasterlist += [mergedlist]
    ppmmasterlist.append(mergedlist)

df5 = pd.DataFrame(ppmmasterlist)