我想在一个熊猫数据框中获得约10万个条目的纬度。由于我只能查询第二个延迟的geopy,因此我想确保我不查询重复项(由于没有那么多城市,所以大多数情况下应该是重复项)
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="xxx")
df['loc']=0
for x in range(1,len(df):
for y in range(1,x):
if df['Location'][y]==df['Location'][x]:
df['lat'][x]=df['lat'][y]
else:
location = geolocator.geocode(df['Location'][x])
time.sleep(1.2)
df.at[x,'lat']=location.latitude
这个想法是要检查位置是否已经在列表中,并且只有在没有查询geopy时才检查。不知何故,它缓慢而缓慢,似乎没有按照我的意图进行。任何帮助或提示,不胜感激。
答案 0 :(得分:0)
进口
Nominatum
地理编码器的信息,请参见geopy文档import pandas as pd
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here") # specify your application name
根据位置生成一些数据
d = ['New York, NY', 'Seattle, WA', 'Philadelphia, PA',
'Richardson, TX', 'Plano, TX', 'Wylie, TX',
'Waxahachie, TX', 'Washington, DC']
df = pd.DataFrame(d, columns=['Location'])
print(df)
Location
0 New York, NY
1 Seattle, WA
2 Philadelphia, PA
3 Richardson, TX
4 Plano, TX
5 Wylie, TX
6 Waxahachie, TX
7 Washington, DC
使用字典对每个this SO post唯一的Location
个进行地理编码
lat
和lon
(作为DataFrame
的单个列中的元组)locations = df['Location'].unique()
# Create dict of geoencodings
d = (
dict(zip(locations, pd.Series(locations)
.apply(geolocator.geocode, args=(10,))
.apply(lambda x: (x.latitude, x.longitude)) # get tuple of latitude and longitude
)
)
)
# Map dict to `Location` column
df['city_coord'] = df['Location'].map(d)
# Split single column of tuples into multiple (2) columns
df[['lat','lon']] = pd.DataFrame(df['city_coord'].tolist(), index=df.index)
print(df)
Location city_coord lat lon
0 New York, NY (40.7308619, -73.9871558) 40.730862 -73.987156
1 Seattle, WA (47.6038321, -122.3300624) 47.603832 -122.330062
2 Philadelphia, PA (39.9524152, -75.1635755) 39.952415 -75.163575
3 Richardson, TX (32.9481789, -96.7297206) 32.948179 -96.729721
4 Plano, TX (33.0136764, -96.6925096) 33.013676 -96.692510
5 Wylie, TX (33.0151201, -96.5388789) 33.015120 -96.538879
6 Waxahachie, TX (32.3865312, -96.8483311) 32.386531 -96.848331
7 Washington, DC (38.8950092, -77.0365625) 38.895009 -77.036563
答案 1 :(得分:0)
准备初始数据框:
import pandas as pd
df = pd.DataFrame({
'some_meta': [1, 2, 3, 4],
'city': ['london', 'paris', 'London', 'moscow'],
})
df['city_lower'] = df['city'].str.lower()
df
Out[1]:
some_meta city city_lower
0 1 london london
1 2 paris paris
2 3 London london
3 4 moscow moscow
使用独特的城市创建一个新的DataFrame:
df_uniq_cities = df['city_lower'].drop_duplicates().to_frame()
df_uniq_cities
Out[2]:
city_lower
0 london
1 paris
3 moscow
在新的DataFrame上运行geopy的地理编码:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="specify_your_app_name_here")
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)
df_uniq_cities['location'] = df_uniq_cities['city_lower'].apply(geocode)
# Or, instead, do this to get a nice progress bar:
# from tqdm import tqdm
# tqdm.pandas()
# df_uniq_cities['location'] = df_uniq_cities['city_lower'].progress_apply(geocode)
df_uniq_cities
Out[3]:
city_lower location
0 london (London, Greater London, England, SW1A 2DU, UK...
1 paris (Paris, Île-de-France, France métropolitaine, ...
3 moscow (Москва, Центральный административный округ, М...
将初始DataFrame与新的合并:
df_final = pd.merge(df, df_uniq_cities, on='city_lower', how='left')
df_final['lat'] = df_final['location'].apply(lambda location: location.latitude if location is not None else None)
df_final['long'] = df_final['location'].apply(lambda location: location.longitude if location is not None else None)
df_final
Out[4]:
some_meta city city_lower location lat long
0 1 london london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
1 2 paris paris (Paris, Île-de-France, France métropolitaine, ... 48.856610 2.351499
2 3 London london (London, Greater London, England, SW1A 2DU, UK... 51.507322 -0.127647
3 4 moscow moscow (Москва, Центральный административный округ, М... 55.750446 37.617494
解决超时问题的关键是geopy的RateLimiter
类。查看文档以获取更多详细信息:https://geopy.readthedocs.io/en/1.18.1/#usage-with-pandas