Question

我有一段代码：

for url in get_lines(file):
    visit(url, timeout=timeout)

它从文件中获取URL并在for循环中访问它（通过 urllib2 ）。

可以在几个线程中执行此操作吗？例如，同时进行10次访问。

我试过了：

for url in get_lines(file):
    Thread(target=visit, args=(url,), kwargs={"timeout": timeout}).start()

但它不起作用 - 没有效果，正常访问网址。

函数的简化版访问：

def visit(url, proxy_addr=None, timeout=30):
    (...)
    request = urllib2.Request(url)
    response = urllib2.urlopen(request)
    return response.read()

Answer 1

要扩展senderle的答案，您可以在多处理中使用Pool类来轻松完成此操作：

from multiprocessing import Pool
pool = Pool(processes=5)
pages = pool.map(visit, get_lines(file))

当map函数返回时，“pages”将是URL内容的列表。您可以将进程数调整为适合您系统的任何进程。

Answer 2

我怀疑你遇到了Global Interpreter Lock。基本上，python中的threading无法实现并发性，这似乎是您的目标。您需要使用multiprocessing代替。

multiprocessing旨在与threading具有大致类似的界面，但它有一些怪癖。我相信，上面写的visit函数应该正常工作，因为它是以函数式编写的，没有副作用。

在multiprocessing中，Process类相当于Thread中的threading类。它具有所有相同的方法，因此在这种情况下它是替代品。（虽然我认为您可以使用pool作为JoeZuntz建议 - 但我会首先测试基本的Process类，看它是否解决了问题。）

Python 2.5 - 多线程for循环

2 个答案: