Question

我正在尝试使用Python从URL列表中下载图像。为了加快这个过程，我使用了多处理库。

我面临的问题是脚本经常自行挂起/冻结，我不知道为什么。

以下是我正在使用的代码

...
import multiprocessing as mp

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

它常常被卡在列表的中间（它打印DONE，或者CAN下载到已经处理的列表的一半但我不知道其余部分发生了什么）。有人遇到过这个问题吗？我搜索过类似的问题（例如link），但没有找到答案。

提前致谢

Answer 1

看起来您正面临GIL问题：python Global Interpreter Lock基本上禁止python同时执行多个任务。 Multiprocessing模块实际上是启动python的单独实例，以便并行完成工作。

但是在你的情况下，urllib在所有这些实例中被调用：他们每个人都试图锁定IO进程：成功的人（例如先来）会得到结果，而其他人（试图锁定已经成功）锁定的过程）失败。

这是一个非常简化的解释，但这里有一些额外的资源：

您可以在此处找到另一种并行化请求的方法：Multiprocessing useless with urllib2?

有关GIL的更多信息，请访问：What is a global interpreter lock (GIL)?

Answer 2

好的，我找到了答案。

可能的罪魁祸首是脚本在从URL连接/下载时陷入困境。所以我添加的是套接字超时来限制连接和下载图像的时间。

现在，这个问题不再困扰我了。

这是我的完整代码

...
import multiprocessing as mp

import socket

# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)

def getImages(val):

    #Dowload images
    try:
        url= # preprocess the url from the input val
        local= #Filename Generation From Global Varables And Rand Stuffs...
        urllib.request.urlretrieve(url,local)
        print("DONE - " + url)
        return 1
    except Exception as e:
        print("CAN'T DOWNLOAD - " + url )
        return 0

if __name__ == '__main__':

    files = "urls.txt"
    lst = list(open(files))
    lst = [l.replace("\n", "") for l in lst]

    pool = mp.Pool(processes=4)
    res = pool.map(getImages, lst)

    print ("tempw")

希望这个解决方案可以帮助那些面临同样问题的人

request.urlretrieve在多处理Python中被卡住了

2 个答案: