Python:使用multiprocessing.map

时间:2016-04-28 19:54:36

标签: python multiprocess

我正在使用multiprocessing模块进行并行网址检索。我的代码就像:

pat = re.compile("(?P<url>https?://[^\s]+)")
def resolve_url(text):
    missing = 0
    bad = 0
    url = 'before'
    long_url = 'after'
    match = pat.search(text) ## a text looks like "I am at here. http:.....(a URL)"
    if not match:
        missing = 1
    else:
        url = match.group("url")
        try:
            long_url = urllib2.urlopen(url).url
        except:
            bad = 1
    return (url, long_url, missing, bad)

if __name__ == '__main__':
    pool = multiprocessing.Pool(100)
    resolved_urls = pool.map(resolve_url, checkin5)  ## checkin5 is a list of texts

问题是,我的checkin5列表包含 600,000 元素,这项并行工作确实需要时间。我想检查过程中已解决了多少元素。如果在一个简单的for循环中,我可以这样做:

resolved_urls = []
now = time.time()
for i, element in enumerate(checkin5):
    resolved_urls.append(resolve_url(element))
    if i%1000 == 0:
        print("from %d to %d: %2.5f seconds" %(i-1000, i, time.time()-now))
        now = time.time()

但是现在我需要提高效率,所以多进程是必要的,但我不知道在这种情况下如何检查进程,任何想法?

顺便说一句,为了检查上述方法在这种情况下是否也有效,我尝试了玩具代码:

import multiprocessing
import time

def cal(x):
    res = x*x
    return res

if __name__ == '__main__':
    pool = multiprocessing.Pool(4)

    t0 = time.time()
    result_list = pool.map(cal,range(1000000))
    print(time.time()-t0)

    t0 = time.time()
    for i, result in enumerate(pool.map(cal, range(1000000))):
        if i%100000 == 0:
            print("%d elements have been calculated, %2.5f" %(i, time.time()-t0))
            t0 = time.time()

结果是:

0.465271949768
0 elements have been calculated, 0.45459
100000 elements have been calculated, 0.02211
200000 elements have been calculated, 0.02142
300000 elements have been calculated, 0.02118
400000 elements have been calculated, 0.01068
500000 elements have been calculated, 0.01038
600000 elements have been calculated, 0.01391
700000 elements have been calculated, 0.01174
800000 elements have been calculated, 0.01098
900000 elements have been calculated, 0.01319

从结果来看,我认为单进程的方法在这里不起作用。似乎首先调用pool.map,然后在计算结束后调用并获得完整列表,然后enumerate开始......我是对的吗? / p>

2 个答案:

答案 0 :(得分:2)

您应该能够使用Pool.imapPool.imap_unordered执行此操作,具体取决于您是否关心结果的排序。他们都是无阻碍的......

resolved_urls = []
pool = multiprocessing.Pool(100)
res = pool.imap(resolve_url, checkin5)

for x in res:
    resolved_urls.append(x)
    print 'finished one'
    # ... whatever counting/tracking code you want here

答案 1 :(得分:1)

首先,我相信@ danf1024有答案。这是为了解决从pool.map切换到pool.imap时的减速问题。

这是一个小实验:

from multiprocessing import Pool


def square(x):
    return x * x


N = 10 ** 4
l = list(range(N))


def test_map(n=N):
    list(Pool().map(square, l))

# In [3]: %timeit -n10 q.test_map()
# 10 loops, best of 3: 14.2 ms per loop


def test_imap(n=N):
    list(Pool().imap(square, l))

# In [4]: %timeit -n10 q.test_imap()
# 10 loops, best of 3: 232 ms per loop


def test_imap1(n=N):
    list(Pool(processes=1).imap(square, l))

# In [5]: %timeit -n10 q.test_imap1()
# 10 loops, best of 3: 191 ms per loop


def test_map_naive(n=N):
    # cast map to list in python3
    list(map(square, l))

# In [6]: %timeit -n10 q.test_map_naive()
# 10 loops, best of 3: 1.2 ms per loop

因为与例如下载和解析网页相比,平方是一种廉价的操作,如果每个处理器可以处理大量不间断的输入块,并行化将会获益。 imap的情况并非如此,我的4核心的表现非常糟糕。有趣的是,将进程数量限制为1会使imap变得更快,因为竞赛条件已被删除。

但是,当您转向成本更高的操作时,imapmap之间的差异变得越来越小。