熊猫和多处理

时间:2017-10-04 16:46:15

标签: python pandas multiprocessing

我正在使用FCC api将纬度/经度坐标转换为块组代码:

import pandas as pd
import numpy as np
import urllib
import time
import json

# getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

new_list = []

def block(x):
    for index,row in x.iterrows():
        #request url and read the output
        a = urllib.request.urlopen(getup + row['lat'] + getup1 + row['long'] + getup2).read()
        #load json output in to a form python can understand
        a1 = json.loads(a)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])

#call the function with latlong as the argument.        
block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

给出了这个输出:

['360610031001021', '060372074001033', '170318391001104', '482011000003087', 
 '421010005001010', '040131141001032', '480291101002041', '060730053003011', 
 '481130204003064', '060855010004004', '484530011001092', '180973910003057', 
 '120310010001023', '060750201001001', '390490040001005', '371190001005000', 
 '484391233002071', '261635172001069', '481410029001001', '471570042001018']

这个脚本的问题是我每行只能调用一次api。脚本运行大约需要5分钟,这对于我计划使用此脚本的1,000,000多个条目是不可接受的。

我想使用多处理来并行此函数以减少运行该函数的时间。我试图查看多处理手册,但是无法弄清楚如何运行该函数并将输出并行添加到空列表中。

仅供参考:我使用的是python 3.6

任何指导都会很棒!

1 个答案:

答案 0 :(得分:1)

您不必自己实现并行性,有比urllib更好的库,例如请求[0]和使用线程或期货的一些衍生产品[1]。我想你需要检查自己哪一个是最快的。

由于依赖项少,我最喜欢请求 - 期货,这里我使用十个线程实现您的代码。如果您相信或者发现它在某种程度上更好,那么图书馆甚至会支持流程:

import pandas as pd
import numpy as np
import urllib
import time
import json
from concurrent.futures import ThreadPoolExecutor

from requests_futures.sessions import FuturesSession

#getup, getup1, and getup2 make up the url to the api
getup = 'http://data.fcc.gov/api/block/find?format=json&latitude='

getup1 = '&longitude='

getup2 = '&showall=false'

lat = ['40.7127837','34.0522342','41.8781136','29.7604267','39.9525839',
 '33.4483771','29.4241219','32.715738','32.7766642','37.3382082','30.267153',
 '39.768403','30.3321838','37.7749295','39.9611755','35.2270869',
 '32.7554883','42.331427','31.7775757','35.1495343']

long = ['-74.0059413','-118.2436849','-87.6297982','-95.3698028','-75.1652215',
 '-112.0740373','-98.4936282','-117.1610838','-96.7969879','-121.8863286',
 '-97.7430608','-86.158068','-81.655651','-122.4194155','-82.9987942',
 '-80.8431267','-97.3307658','-83.0457538','-106.4424559','-90.0489801']

#make lat and long in to a Pandas DataFrame
latlong = pd.DataFrame([lat,long]).transpose()
latlong.columns = ['lat','long']

def block(x):
    requests = []
    session = FuturesSession(executor=ThreadPoolExecutor(max_workers=10))
    for index, row in x.iterrows():
        #request url and read the output
        url = getup+row['lat']+getup1+row['long']+getup2        
        requests.append(session.get(url))
    new_list = []
    for request in requests:
        #load json output in to a form python can understand
        a1 = json.loads(request.result().content)
        #append output to an empty list.
        new_list.append(a1['Block']['FIPS'])
    return new_list

#call the function with latlong as the argument.        
new_list = block(latlong)

#print the list, note: it is important that function appends to the list
print(new_list)

[0] http://docs.python-requests.org/en/master/

[1] https://github.com/kennethreitz/grequests