计算数据的熊猫优化

时间:2018-03-15 22:23:21

标签: python pandas optimization

我使用pandas来计算某个点周围某个区域的年龄范围的人数。为此,我有2个Dataframes:

一个(df1),其中包含城镇名称,地理坐标和按年龄段划分的人数:

    LIBCOM  POP_TOTALE  POP_0_2 POP_3_5 POP_6_10    POP_11_17   POP_18_24   latitude    longitude
COM                                 
01001   L'Abergement-Clémenciat 767 29  18  57  90  36  46.1534255214   4.92611354223
01002   L'Abergement-de-Varey   239 13  9   13  23  7   46.0091878776   5.42801696363
01004   Ambérieu-en-Bugey   14020   651 539 884 1245    1297    45.9608475114   5.3729257777

另一个(df2),其中包含我想要计算人数的点数:

CODE Groupe nom longitude   latitude    pop_18_25_0km   pop_18_25_5km   pop_18_25_10km  pop_18_25_25km  pop_18_25_50km
0   107510200   POINT 1 - PARIS 16  2.26919 48.8472 0   118575  391308  870557  1082052
1   107510400   POINT 2 - PARIS 16  2.27929 48.85667    0   151231  399637  870857  1082565
2   107510700   POINT 3 - PARIS 16  2.26    48.84536    0   91434   349967  868646  1080285

我意识到以下程序,但它没有得到很好的优化。我使用geopy lib和vincenty函数来计算距离。

df2["pop_18_25_5km"] = 0
df2["pop_18_25_10km"] = 0
df2["pop_18_25_25km"] = 0
df2["pop_18_25_50km"] = 0
for index_df2, row_df2 in df2.iterrows():
    for index_pop, row_pop in df1.iterrows():
        try:
            distance = vincenty((str(row_pop["latitude"]), str(row_pop["longitude"])),
                                (str(row_df2["latitude"]), str(row_df2["longitude"]))).km
        except UnboundLocalError:
            distance = 0
        if distance < 5:
            df2.loc[index_df2, "pop_18_25_5km"] += row_pop["POP_18_24"]
        if distance < 10:
            df2.loc[index_df2, "pop_18_25_10km"] += row_pop["POP_18_24"]
        if distance < 25:
            df2.loc[index_df2, "pop_18_25_25km"] += row_pop["POP_18_24"]
        if distance < 50:
            df2.loc[index_df2, "pop_18_25_50km"] += row_pop["POP_18_24"]
df2.head()

我阅读了以下文章,但我不知道如何将其应用于我的问题:https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

提前致谢并抱歉我的法语英语!

1 个答案:

答案 0 :(得分:0)

不可避免地存在x点复杂性,但仍有一些空间可以通过用向量化操作替换迭代来进行优化。

我很抱歉粗略的代码,但我希望你明白这个想法:

df2["pop_18_25_5km"] = 0
df2["pop_18_25_10km"] = 0
df2["pop_18_25_25km"] = 0
df2["pop_18_25_50km"] = 0

def dist(point, person):
    try:
        return vincenty(
            (str(person["latitude"], str(person["longitude"])),
            (str(point["latitude"]), str(point["longitude"]))
        ).km
    except UnboundLocalError:
        return float("inf")  # I think it's more appropriate than 0

# hopefully df2 is much smaller than df1
# also it would help to filter df1 to include only target population
for idx, point in df2.iterrows():
    # this is where the most improvement come from,
    # replacement of iteration over df1 with a vectorized operation
    df1[point["nom"]] = df1.apply(lambda person: dist(point, person))
    for distance in (5, 10, 25, 50):
        df2.loc[idx, "pop_18_25_%s" % distance] = df1[df1[point["nom"]] < distance, "POP_18_24"].sum()

但是,如果你可以安全地假设公社(?)与点之间的一对一关系,那么它将大大降低复杂性:

df1['point'] = some_magic()
df1['distance'] = more_magic()  # .apply(), wrt df1['point']
df1['point', 'distance', 'POP_12_244'].groupby('point', 'distance').sum()