我有一个来自max mind的数据库。该数据库为我提供了IP的位置信息。我写了下面的函数来从ip检索城市和国家:-
import geoip2.database
def country(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'
def city(ipa):
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'
我每分钟都要处理一次,并应用于大熊猫的raddr
列:-
df['country']=df['raddr'].apply(country)
df['city']=df['raddr'].apply(city)
问题在于,每次迭代都需要花费3分钟以上的时间才能执行,我得到了大约15万行,并且我在每个函数上都应用了该函数。
我想在不到一分钟的时间内完成此操作。 任何建议。
答案 0 :(得分:2)
您的功能未优化。想象一下在应用函数时必须读取每一行的数据库。甚至maxmind的github都特别评论说,创建阅读器对象很昂贵:
>>> # This creates a Reader object. You should use the same object
>>> # across multiple requests as creation of it is expensive.
您应该做的是将一个额外的关键字参数传递给您的函数:
def country(ipa, reader):
try:
response = reader.city(ipa)
response = response.country.iso_code
return response
except:
return 'NA'
def city(ipa, reader):
try:
response = reader.city(ipa)
response = response.city.name
return response
except:
return 'NA'
然后使用额外的关键字参数调用您的apply函数:
with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
df['country'] = df['raddr'].apply(country, reader=reader)
df['city'] = df['raddr'].apply(city, reader=reader)