Question

我试图使用Python下载几千个图像并进行多处理和请求库。事情开始很好，但大约100张图片，一切都锁定，我必须杀死进程。我使用的是python 2.7.6。这是代码：

import requests
import shutil
from multiprocessing import Pool
from urlparse import urlparse

def get_domain_name(s):
    domain_name = urlparse(s).netloc 
    new_s = re.sub('\:', '_', domain_name)  #replace colons
    return new_s

def grab_image(url):
    response = requests.get(url, stream=True, timeout=2)
    if response.status_code == 200:
        img_name = get_domain_name(url)
        with open(IMG_DST + img_name + ".jpg", 'wb') as outf:
            shutil.copyfileobj(response.raw, outf)
        del response

def main():                                        
    with open(list_of_image_urls, 'r') as f:                 
        urls = f.read().splitlines()                                                             
    urls.sort()                                    
    pool = Pool(processes=4, maxtasksperchild=2)   
    pool.map(grab_image, urls)                     
    pool.close()                                   
    pool.join()

if __name__ == "__main__":
    main()

编辑：将多处理导入更改为multiprocessing.dummy以使用线程而不是进程后，我仍然遇到同样的问题。我似乎有时会遇到一个运动jpeg流而不是单个图像，这会导致相关的问题。为了解决这个问题，我使用了一个上下文管理器，并创建了一个FileTooBigException。虽然我没有实施检查以确保我确实下载了图片文件和其他一些房屋清洁工具，但我认为以下代码可能对某人有用：

class FileTooBigException(requests.exceptions.RequestException):
    """File over LIMIT_SIZE"""


def grab_image(url):
    try:
        img = ''
        with closing(requests.get(url, stream=True, timeout=4)) as response:
            if response.status_code == 200:
                content_length = 0
                img_name = get_domain_name(url)
                img = IMG_DST + img_name + ".jpg"
                with open(img, 'wb') as outf:
                    for chunk in response.iter_content(chunk_size=CHUNK_SIZE):
                        outf.write(chunk)
                        content_length = content_length + CHUNK_SIZE
                        if(content_length > LIMIT_SIZE):
                            raise FileTooBigException(response)
    except requests.exceptions.Timeout:
        pass
    except requests.exceptions.ConnectionError:
        pass
    except socket.timeout:
        pass
    except FileTooBigException:
        os.remove(img)
        pass

并且，欢迎任何建议的改进！

Answer 1

使用ORDER BY进行I / O并发是没有意义的。在网络I / O中，所涉及的线程大部分时间都在等待。并且 Python线程非常适合无所事事。因此使用线程池而不是进程池。每个进程都消耗大量资源，对于I / O绑定活动而言是不必要的。线程共享进程状态，正是您正在寻找的。

使用Python请求和多处理下载许多映像

1 个答案: