Question

我想在Python中编写一个脚本，可以从数据库中获取url，并同时下载网页以加快速度，而不是等待每个页面一个接一个地下载。

根据this thread，Python不允许这样做，因为名为Global Interpreter Lock的东西会阻止多次激活同一个脚本。

在投入时间学习Twisted框架之前，我想确保没有更简单的方法来完成我需要做的事情。

感谢您的任何提示。

Answer 1

不要担心GIL。在你的情况下，这没关系。

最简单的方法是使用线程模块和ASPN中的一个线程池实现来创建线程池。该池中的每个线程都可以使用 httplib 来下载您的网页。

另一个选择是使用PyCURL模块 - 它本身支持并行下载，因此您不必自己实现它。

Answer 2

GIL阻止您有效地使用线程进行处理器负载平衡。由于这不是处理器负载平衡但是阻止一个IO等待停止整个下载，因此GIL与此无关。 *）

所以你需要做的就是创建几个同时下载的进程。您可以使用线程模块或多处理模块来完成此操作。

*）嗯...除非你有千兆连接，你的问题实际上是你的处理器在你的网络之前超载。但这显然不是这种情况。

Answer 3

我最近解决了同样的问题。有一点需要考虑的是，有些人并不善待他们的服务器陷入困境，并会阻止这样做的IP地址。我听到的标准礼貌是在页面请求之间大约3秒钟，但这很灵活。

如果您从多个网站下载，则可以按域对URL进行分组，并为每个网站创建一个帖子。然后在你的主题中你可以这样做：

for url in urls:
    timer = time.time()
    # ... get your content ...
    # perhaps put content in a queue to be written back to 
    # your database if it doesn't allow concurrent writes.
    while time.time() - timer < 3.0:
        time.sleep(0.5)

有时候只是得到你的回复将花费整整3秒，你不必担心它。

当然，如果您只是从一个网站下载，这对您没有任何帮助，但它可能会阻止您被阻止。

我的机器处理大约200个线程，然后管理它们的开销减慢了进程。我的结果是每秒40-50页。

Answer 4

现在有很好的Python库可以帮助你做到这一点 - urllib3和requests

Answer 5

urllib＆amp; threading（或multiprocessing）个套餐可以满足您所需的“蜘蛛”。

您需要做的是从DB获取URL，并为每个url启动一个线程或进程抓住网址。

就像示例一样（错过了数据库网址检索）：

#!/usr/bin/env python
import Queue
import threading
import urllib2
import time

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
    "http://ibm.com", "http://apple.com"]

queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue):
        threading.Thread.__init__(self)
        self.queue = queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and prints first 1024 bytes of page
            url = urllib2.urlopen(host)
            print url.read(1024)

            #signals to queue job is done
            self.queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    #wait on the queue until everything has been processed
    queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

Answer 6

您可以查看避免GIL锁定问题的multiprocessing package。

Answer 7

下载是IO，可以使用非阻塞套接字异步完成或扭曲。这两种解决方案都比线程或多处理更有效。

Answer 8

我今天得到了一些专家的帮助，因为他发布了python 2.X代码以供参考。 http://paste.ubuntu.com/24529360/

from multiprocessing.dummy import Pool
import urllib
def download(_id):
    success = False
    fail_count = 0
    url = 'fuck_url/%s'%_id
    while not success:
        if fail_count>10:
            print url,'download faild'
            return 
        try:
            urllib.urlretrieve(url,'%s.html'%_id)
            success = True
        except Exception:
            fail_count+=1
            pass

if __name__ == '__main__':
    pool = Pool(processes=100) # 100 thread
    pool.map(download,range(30000))
    pool.close()
    pool.join()

我更喜欢使用requests，因为header可以轻松添加proxy。

import requests
hdrs = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}
r = requests.get(url, headers=hdrs, verify=False)#, proxies=proxyDict)
            with open(fullFileName, 'wb') as code:
                code.write(r.content)

同时下载多个页面？

8 个答案: