Scrapy无法正确处理坏标头[ScrapyHTTPPageGetter,client]未处理错误

时间:2012-12-12 12:01:53

标签: python-2.7 screen-scraping web-scraping web-crawler scrapy

环境:

Scrapy 0.16.2 双绞线12.2.0 python 2.7 MacOSX的-10.6

这是我的问题:

我尝试运行

scrapy shell http://aaa.17domn.com/bt9/file.php/MERH77V.html

错误:

[ScrapyHTTPPageGetter,client] Unhandled Error
    Traceback (most recent call last):
      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/selectreactor.py", line 150, in _doReadOrWrite
        why = getattr(selectable, method)()

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 202, in doRead
        return self._dataReceived(data)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/tcp.py", line 208, in _dataReceived
        rval = self.protocol.dataReceived(data)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/protocols/basic.py", line 564, in dataReceived
        why = self.lineReceived(line)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 50, in lineReceived
        return HTTPClient.lineReceived(self, line.rstrip())

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 450, in lineReceived
        self.extractHeader(self._header)

      File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/web/http.py", line 406, in extractHeader
        key, val = header.split(':',1)
    exceptions.ValueError: need more than 1 value to unpack

我从https://groups.google.com/forum/#!msg/scrapy-users/xFKo8ggzPxs/VXDl3CZ4V4cJ找到了解决方案 他们形容这是由扭曲引起的。然后我在http://twistedmatrix.com/trac/ticket/2842的/twisted/web/http.py中修补了函数extractHeader。它的作品

但是,坚持到现在!

我运行另一个网站

scrapy shell http://www1.wkdown.info/fs3/file.php/M994ATR.html

错误:

Traceback (most recent call last):

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Twisted-12.2.0-py2.7-macosx-10.6-intel.egg/twisted/internet/defer.py", line 551, in _runCallbacks
    current.result = callback(current.result, *args, **kw)

  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-0.16.2-py2.7.egg/scrapy/core/downloader/webclient.py", line 122, in _build_response
    status = int(self.status)

ValueError: invalid literal for int() with base 10: 'html'

我觉得响应标题会发生一些事情。 Scrapy无法很好地处理它。 任何的想法? 谢谢!

0 个答案:

没有答案