使用gevent和请求异步模块的ImportError

时间:2012-04-22 10:14:06

标签: python concurrency gevent

我正在编写一个简单的脚本:

  1. 加载大量网址
  2. 使用requests' async模块
  3. 获取发出并发HTTP请求的每个URL的内容
  4. 使用lxml解析页面内容以检查链接是否在页面中
  5. 如果页面上存在该链接,请在ZODB数据库中保存有关该页面的一些信息
  6. 当我用4或5个URL测试脚本时效果很好,脚本结束时我只有以下消息:

     Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
    

    但是,当我尝试检查大约24000个网址时,它会在列表末尾(当剩下大约400个网址要检查时)失败时出现以下错误:

    Traceback (most recent call last):
      File "check.py", line 95, in <module>
      File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/requests/async.py", line 83, in map
      File "/home/alex/code/.virtualenvs/linka/local/lib/python2.7/site-packages/gevent-1.0b2-py2.7-linux-x86_64.egg/gevent/greenlet.py", line 405, in joinall
    ImportError: No module named queue
    Exception KeyError: KeyError(45989520,) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored
    

    我尝试使用pypi上提供的gevent版本以及从gevent repository下载并安装最新版本(1.0b2)。

    我无法理解为什么会发生这种情况,以及为什么只有在检查一堆网址时才会发生这种情况。有什么建议?

    以下是整个脚本:

    from requests import async, defaults
    from lxml import html
    from urlparse import urlsplit
    from gevent import monkey
    from BeautifulSoup import UnicodeDammit
    from ZODB.FileStorage import FileStorage
    from ZODB.DB import DB
    import transaction
    import persistent
    import random
    
    storage = FileStorage('Data.fs')
    db = DB(storage)
    connection = db.open()
    root = connection.root()
    monkey.patch_all()
    defaults.defaults['base_headers']['User-Agent'] = "Mozilla/5.0 (Windows NT 5.1; rv:11.0) Gecko/20100101 Firefox/11.0"
    defaults.defaults['max_retries'] = 10
    
    
    def save_data(source, target, anchor):
        root[source] = persistent.mapping.PersistentMapping(dict(target=target, anchor=anchor))
        transaction.commit()
    
    
    def decode_html(html_string):
        converted = UnicodeDammit(html_string, isHTML=True)
        if not converted.unicode:
            raise UnicodeDecodeError(
                "Failed to detect encoding, tried [%s]",
                ', '.join(converted.triedEncodings))
        # print converted.originalEncoding
        return converted.unicode
    
    
    def find_link(html_doc, url):
        decoded = decode_html(html_doc)
        doc = html.document_fromstring(decoded.encode('utf-8'))
        for element, attribute, link, pos in doc.iterlinks():
            if attribute == "href" and link.startswith('http'):
                netloc = urlsplit(link).netloc
                if "example.org" in netloc:
                    return (url, link, element.text_content().strip())
        else:
            return False
    
    
    def check(response):
        if response.status_code == 200:
            html_doc = response.content
            result = find_link(html_doc, response.url)
            if result:
                source, target, anchor = result
                # print "Source: %s" % source
                # print "Target: %s" % target
                # print "Anchor: %s" % anchor
                # print
                save_data(source, target, anchor)
        global todo
        todo = todo -1
        print todo
    
    def load_urls(fname):
        with open(fname) as fh:
            urls = set([url.strip() for url in fh.readlines()])
            urls = list(urls)
            random.shuffle(urls)
            return urls
    
    if __name__ == "__main__":
    
        urls = load_urls('urls.txt')
        rs = []
        todo = len(urls)
        print "Ready to analyze %s pages" % len(urls)
        for url in urls:
            rs.append(async.get(url, hooks=dict(response=check), timeout=10.0))
        responses = async.map(rs, size=100)
        print "DONE."
    

3 个答案:

答案 0 :(得分:1)

我不确定你问题的根源是什么,但为什么你的monkey.patch_all()不在文件顶部?

你可以试试

吗?
from gevent import monkey; monkey.patch_all()

在主程序的顶部,看看它是否修复了什么?

答案 1 :(得分:0)

我是一个非常大的n00b但无论如何,我可以尝试......! 我猜您可以尝试通过以下方式更改导入列表:

from requests import async, defaults
import requests
from lxml import html
from urlparse import urlsplit
from gevent import monkey
import gevent
from BeautifulSoup import UnicodeDammit
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import random

试试这个并告诉我它是否有效..我猜这可以解决你的问题:)

答案 2 :(得分:0)

美好的一天。 我认为这是一个开放的python bug,编号为Issue1596321 http://bugs.python.org/issue1596321