我使用pandas来计算某个点周围某个区域的年龄范围的人数。为此,我有2个Dataframes:
一个(df1),其中包含城镇名称,地理坐标和按年龄段划分的人数:
LIBCOM POP_TOTALE POP_0_2 POP_3_5 POP_6_10 POP_11_17 POP_18_24 latitude longitude
COM
01001 L'Abergement-Clémenciat 767 29 18 57 90 36 46.1534255214 4.92611354223
01002 L'Abergement-de-Varey 239 13 9 13 23 7 46.0091878776 5.42801696363
01004 Ambérieu-en-Bugey 14020 651 539 884 1245 1297 45.9608475114 5.3729257777
另一个(df2),其中包含我想要计算人数的点数:
CODE Groupe nom longitude latitude pop_18_25_0km pop_18_25_5km pop_18_25_10km pop_18_25_25km pop_18_25_50km
0 107510200 POINT 1 - PARIS 16 2.26919 48.8472 0 118575 391308 870557 1082052
1 107510400 POINT 2 - PARIS 16 2.27929 48.85667 0 151231 399637 870857 1082565
2 107510700 POINT 3 - PARIS 16 2.26 48.84536 0 91434 349967 868646 1080285
我意识到以下程序,但它没有得到很好的优化。我使用geopy lib和vincenty函数来计算距离。
df2["pop_18_25_5km"] = 0
df2["pop_18_25_10km"] = 0
df2["pop_18_25_25km"] = 0
df2["pop_18_25_50km"] = 0
for index_df2, row_df2 in df2.iterrows():
for index_pop, row_pop in df1.iterrows():
try:
distance = vincenty((str(row_pop["latitude"]), str(row_pop["longitude"])),
(str(row_df2["latitude"]), str(row_df2["longitude"]))).km
except UnboundLocalError:
distance = 0
if distance < 5:
df2.loc[index_df2, "pop_18_25_5km"] += row_pop["POP_18_24"]
if distance < 10:
df2.loc[index_df2, "pop_18_25_10km"] += row_pop["POP_18_24"]
if distance < 25:
df2.loc[index_df2, "pop_18_25_25km"] += row_pop["POP_18_24"]
if distance < 50:
df2.loc[index_df2, "pop_18_25_50km"] += row_pop["POP_18_24"]
df2.head()
我阅读了以下文章,但我不知道如何将其应用于我的问题:https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
提前致谢并抱歉我的法语英语!
答案 0 :(得分:0)
不可避免地存在x点复杂性,但仍有一些空间可以通过用向量化操作替换迭代来进行优化。
我很抱歉粗略的代码,但我希望你明白这个想法:
df2["pop_18_25_5km"] = 0
df2["pop_18_25_10km"] = 0
df2["pop_18_25_25km"] = 0
df2["pop_18_25_50km"] = 0
def dist(point, person):
try:
return vincenty(
(str(person["latitude"], str(person["longitude"])),
(str(point["latitude"]), str(point["longitude"]))
).km
except UnboundLocalError:
return float("inf") # I think it's more appropriate than 0
# hopefully df2 is much smaller than df1
# also it would help to filter df1 to include only target population
for idx, point in df2.iterrows():
# this is where the most improvement come from,
# replacement of iteration over df1 with a vectorized operation
df1[point["nom"]] = df1.apply(lambda person: dist(point, person))
for distance in (5, 10, 25, 50):
df2.loc[idx, "pop_18_25_%s" % distance] = df1[df1[point["nom"]] < distance, "POP_18_24"].sum()
但是,如果你可以安全地假设公社(?)与点之间的一对一关系,那么它将大大降低复杂性:
df1['point'] = some_magic()
df1['distance'] = more_magic() # .apply(), wrt df1['point']
df1['point', 'distance', 'POP_12_244'].groupby('point', 'distance').sum()