python线程的奇怪行为

时间:2019-03-25 10:48:49

标签: python multithreading

我有一些要在python中使用线程并行化的代码。功能是:

def sanity(url):
    global count
    count+=1
    if count%1000==0:
       print(count)
    try:
       if 'media' in url[:10]:
           url = "http://dummy.s3.amazonaws.com" + url
       req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
       ret = urllib.request.urlopen(req)
       all_urls.append(url)
       return 1
except (urllib.request.HTTPError,urllib.request.URLError,http.client.HTTPException, ValueError) as e:
    print(e, url)
    allurls.append(url)
    errors.append(url)
    return 0

我有一个URL列表,我必须为每个URL运行以上功能。因此,我使用了线程。代码如下:

 start=0
 arr=[0,1000,2000,...15000]
 for i in arr:
     threads = [threading.Thread(target=sanity, args=(url, errors,allurls,)) for url in urls[start:i]]
     [thread.start() for thread in threads]
     [thread.join() for thread in threads]
     if i==0:
        start=0
     else:
        start=i+1

上面的代码在python中使用线程在所有URL上并行运行该函数。但是,返回的结果每次都不同,并且与串行版本的结果不匹配。可能是什么问题?

感谢您的帮助!。

1 个答案:

答案 0 :(得分:0)

我将把并行化的使用限制为对urllib.request.urlopen的I / O绑定调用。好处之一是不必处理全局对象或线程局部对象。

下面的示例使用concurrent.futures。它被编写为一个独立的模块,可以轻松接受argparse输入,例如您的网址列表。您也可以将ThreadPoolExecutor封装在函数中。

from concurrent.futures import ThreadPoolExecutor, as_completed
import urllib

def sanity(url):
    """Attempt an HTML request"""
   if 'media' in url[:10]:
       url = "http://dummy.s3.amazonaws.com" + url
   req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
   ret = urllib.request.urlopen(req)
   return ret

URLS = ('collection', 'of', 'strings')
allurls = []
errors = []

# set `max_workers` to your preferred upper limit of threads
with ThreadPoolExecutor(max_workers=2) as executor:
    pool = {executor.submit(sanity, url): url for url in URLS}

    # perform error handling as each future completes
    for future in as_completed(pool):
        res = future.result()
        if isinstance(res, (urllib.request.HTTPError, urllib.request.URLError,
                      http.client.HTTPException, ValueError)):
            print(res, pool[res])
            # append the URL to lists as appropriate
            allurls.append(pool[res])
            errors.append(pool[res])
        else:
            allurls.append(pool[res])

# do something with `allurls` and `errors`