快速应用在大熊猫中

时间:2020-09-29 02:46:41

标签: python pandas dataframe

我有一个来自max mind的数据库。该数据库为我提供了IP的位置信息。我写了下面的函数来从ip检索城市和国家:-

import geoip2.database
def country(ipa):
    with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
        try:
            response = reader.city(ipa)
            response = response.country.iso_code
            return response
        except:
            return 'NA'
        
def city(ipa):
    with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
        try:
            response = reader.city(ipa)
            response = response.city.name
            return response
        except:
            return 'NA'

我每分钟都要处理一次,并应用于大熊猫的raddr列:-

df['country']=df['raddr'].apply(country)
df['city']=df['raddr'].apply(city)

问题在于,每次迭代都需要花费3分钟以上的时间才能执行,我得到了大约15万行,并且我在每个函数上都应用了该函数。

我想在不到一分钟的时间内完成此操作。 任何建议。

1 个答案:

答案 0 :(得分:2)

您的功能未优化。想象一下在应用函数时必须读取每一行的数据库。甚至maxmind的github都特别评论说,创建阅读器对象很昂贵:

>>> # This creates a Reader object. You should use the same object
>>> # across multiple requests as creation of it is expensive.

您应该做的是将一个额外的关键字参数传递给您的函数:

def country(ipa, reader):
    try:
        response = reader.city(ipa)
        response = response.country.iso_code
        return response
    except:
        return 'NA'

def city(ipa, reader):
    try:
        response = reader.city(ipa)
        response = response.city.name
        return response
    except:
        return 'NA'

然后使用额外的关键字参数调用您的apply函数:

with geoip2.database.Reader('/home/jupyter/GeoIP2-City.mmdb') as reader:
    df['country'] = df['raddr'].apply(country, reader=reader)
    df['city'] = df['raddr'].apply(city, reader=reader)