通过python并行运行循环

时间:2017-11-04 01:51:01

标签: python parallel-processing multiprocessing multicore embarrassingly-parallel

我有一个循环遍历IP地址列表并返回一些有关它们的信息的进程。简单的for循环很有效,我的问题是由于Python的全局解释器锁(GIL)而大规模运行。

我的目标是让这个功能并行运行并充分利用我的4核。这样当我运行100K这些时,通过正常的循环不会花费我24小时。

在这里阅读了其他答案,特别是这个How do I parallelize a simple Python loop?后,我决定使用joblib。当我通过它运行10条记录时(例子如上),运行需要10多分钟。这听起来并不合适。我知道有些事我做错了或不理解。非常感谢任何帮助!

import pandas as pd
import numpy as np
import os as os
from ipwhois import IPWhois
from joblib import Parallel, delayed
import multiprocessing

num_core = multiprocessing.cpu_count()

iplookup = ['174.192.22.197',\
            '70.197.71.201',\
            '174.195.146.248',\
            '70.197.15.130',\
            '174.208.14.133',\
            '174.238.132.139',\
            '174.204.16.10',\
            '104.132.11.82',\
            '24.1.202.86',\
            '216.4.58.18']

正常for循环工作正常!

asn=[]
asnid=[]
asncountry=[]
asndesc=[]
asnemail = []
asnaddress = []
asncity = []
asnstate = []
asnzip = []
asndesc2 = []
ipaddr=[]
b=1
totstolookup=len(iplookup)

for i in iplookup:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

传递给joblib以在所有核心上运行的函数!

def run_ip_process(iplookuparray):
    asn=[]
    asnid=[]
    asncountry=[]
    asndesc=[]
    asnemail = []
    asnaddress = []
    asncity = []
    asnstate = []
    asnzip = []
    asndesc2 = []
    ipaddr=[]
    b=1
    totstolookup=len(iplookuparray)

for i in iplookuparray:
    i = str(i)
    print("Running #{} out of {}".format(b,totstolookup))
    try:
        obj=IPWhois(i,timeout=15)
        result=obj.lookup_whois()
        asn.append(result['asn'])
        asnid.append(result['asn_cidr'])
        asncountry.append(result['asn_country_code'])
        asndesc.append(result['asn_description'])
        try:
            asnemail.append(result['nets'][0]['emails'])
            asnaddress.append(result['nets'][0]['address'])
            asncity.append(result['nets'][0]['city'])
            asnstate.append(result['nets'][0]['state'])
            asnzip.append(result['nets'][0]['postal_code'])
            asndesc2.append(result['nets'][0]['description'])
            ipaddr.append(i)
        except:
            asnemail.append(0)
            asnaddress.append(0)
            asncity.append(0)
            asnstate.append(0)
            asnzip.append(0)
            asndesc2.append(0)
            ipaddr.append(i)
    except:
        pass
    b+=1

ipdataframe = pd.DataFrame({'ipaddress':ipaddr,
              'asn': asn,
              'asnid':asnid,
              'asncountry':asncountry,
              'asndesc': asndesc,
            'emailcontact': asnemail,
              'address':asnaddress,
              'city':asncity,
              'state': asnstate,
                'zip': asnzip,
              'ipdescrip':asndesc2})

return ipdataframe 

通过joblib

使用所有核心运行流程
Parallel(n_jobs=num_core)(delayed(run_ip_process)(iplookuparray) for i in iplookup)

0 个答案:

没有答案