scrapy(或selenium)在被重定向到不同的网站后冻结

时间:2013-12-20 01:43:48

标签: python selenium scrapy

我正在使用Selenium运行scrapy CrawlSpider,我面临一些奇怪的问题。蜘蛛爬了一会儿,然后冻结 - 似乎没有做任何事情或卡在一个点上。 我一直遇到这个问题所以为了强行阻止蜘蛛,我不得不杀死PhantomJS驱动程序。我的蜘蛛在外部网站上运行得非常漂亮,但每次我在我定制的localhost网站上尝试它时,蜘蛛都会冻结。以下是错误日志:

scrapy crawl image -o test.csv -t csv
2013-12-19 18:12:43-0700 [scrapy] INFO: Scrapy 0.20.2 started (bot: cultr)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Optional features available: ssl, http11
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE':        
'cultr.spiders', 'FEED_URI': 'test.csv', 'SPIDER_MODULES': ['cultr.spiders'], 'BOT_NAME':       
'cultr', 'USER_AGENT': 'cultr (+http://cultr.business.ualberta.ca)', 'FEED_FORMAT': 'csv'}
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats,   
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: 
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, 
DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, 
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, 
OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled item pipelines: 
2013-12-19 18:12:43-0700 [image] INFO: Spider opened
2013-12-19 18:12:43-0700 [image] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items    
(at 0 items/min)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-19 18:12:43-0700 [image] DEBUG: Crawled (200) <GET http://lh:8000/>     

(引用者:无)

2013-12-19 18:12:43-0700 [image] DEBUG: Visiting start of site:http://lh:8000/
2013-12-19 18:12:43-0700 [image] DEBUG: Parsing images for:http://lh:8000/
2013-12-19 18:12:44-0700 [image] DEBUG: Scraped from <200http://lh:8000/>
{'AreaList': [36864],
 'CSSImagesList': [],
 'ImageIDList': [u':wdc:1387501964546'],
 'ImagesFileNames': [u'homepage-bcorp.png'],
 'ImagesList': [],
 'PositionList': [{'x': 8, 'y': 309}],
 'SiteUrl': u'http://localhosts:8000/',
 'WidthHeightList': [{'height': 192, 'width': 192}],
 'depth': 1,
 'domain': 'http://localhosts:8000',
 'htmlImagesList': [],
 'status': 'ok',
 'totalAreaOfImages': 36864,
 'totalNumberOfImages': 1}

2013-12-19 18:13:33-0700 [image] ERROR: Spider error processing <GET 
 http://<domain>:8000/pages/forbidden.html>
Traceback (most recent call last):
  File 

 "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib

 /python/twisted/internet/base.py", line 800, in runUntilCurrent
    call.func(*call.args, **call.kw)
   File "/System/Library/Frameworks/Python.framework/Versions/2.7/
 Extras/lib/python/twisted/internet/task.py", line 602, in _tick
    taskObj._oneWorkUnit()
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/
 Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
    result = self._iterator.next()
  File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in    
 <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
  File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in 
      iter_errback
    yield next(it)
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", 
    line 23, in process_spider_output
    for x in result:
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", 
    line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Python/2.7/site-
    packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", 
     line 50, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 67, 
    in _parse_response
    cb_res = callback(response, **cb_kwargs) or ()
  File "/Users/eddieantonio/Work/cultr/spider/cultr/spiders/ImageSpider.py", line 164, 
    in parse_images
    driver.get(response.url)
  File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
     line 176, in get
    self.execute(Command.GET, {'url': url})
  File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py", 
     line 162, in execute
    response = self.command_executor.execute(driver_command, params)
  File "/Library/Python/2.7/site-
     packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
    return self._request(url, method=command_info[0], data=data)
  File "/Library/Python/2.7/site-
       packages/selenium/webdriver/remote/remote_connection.py", line 410, in _request
    resp = opener.open(request)
  File 
      "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
     line 404, in open
    response = self._open(req, data)
  File 
      "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", 
       line 422, in _open
    '_open', req)
  File 
     "/System/Library/Frameworks/Python.framework/Versions/2.7
      /lib/python2.7/urllib2.py", 
      line 382, in _call_chain
    result = func(*args)
  File "/System/Library/Frameworks/Python.framework/Versions
      /2.7/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/System/Library/Frameworks/Python.framework/Versions/
       2.7/lib/python2.7/urllib2.py", line 1187, in do_open
    r = h.getresponse(buffering=True)
  File "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 1045, in getresponse
    response.begin()
  File "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 409, in begin
    version, status, reason = self._read_status()
  File 
      "/System/Library/Frameworks/Python.framework/Versions/
      2.7/lib/python2.7/httplib.py", line 373, 
     in _read_status
    raise BadStatusLine(line)
httplib.BadStatusLine: ''

1 个答案:

答案 0 :(得分:0)

httplib.BadStatusLine表示:

  

如果服务器以我们不理解的HTTP状态代码响应,则引发。

我认为抓取您定制的网站时返回了一些错误。您应该使用 scrapy shell requests 来获取http://localhosts:8000/pages/forbidden.html以查看结果。