我正在使用Selenium运行scrapy CrawlSpider,我面临一些奇怪的问题。蜘蛛爬了一会儿,然后冻结 - 似乎没有做任何事情或卡在一个点上。 我一直遇到这个问题所以为了强行阻止蜘蛛,我不得不杀死PhantomJS驱动程序。我的蜘蛛在外部网站上运行得非常漂亮,但每次我在我定制的localhost网站上尝试它时,蜘蛛都会冻结。以下是错误日志:
scrapy crawl image -o test.csv -t csv
2013-12-19 18:12:43-0700 [scrapy] INFO: Scrapy 0.20.2 started (bot: cultr)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Optional features available: ssl, http11
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE':
'cultr.spiders', 'FEED_URI': 'test.csv', 'SPIDER_MODULES': ['cultr.spiders'], 'BOT_NAME':
'cultr', 'USER_AGENT': 'cultr (+http://cultr.business.ualberta.ca)', 'FEED_FORMAT': 'csv'}
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats,
TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,
DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware,
RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware,
OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Enabled item pipelines:
2013-12-19 18:12:43-0700 [image] INFO: Spider opened
2013-12-19 18:12:43-0700 [image] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items
(at 0 items/min)
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-19 18:12:43-0700 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-19 18:12:43-0700 [image] DEBUG: Crawled (200) <GET http://lh:8000/>
(引用者:无)
2013-12-19 18:12:43-0700 [image] DEBUG: Visiting start of site:http://lh:8000/
2013-12-19 18:12:43-0700 [image] DEBUG: Parsing images for:http://lh:8000/
2013-12-19 18:12:44-0700 [image] DEBUG: Scraped from <200http://lh:8000/>
{'AreaList': [36864],
'CSSImagesList': [],
'ImageIDList': [u':wdc:1387501964546'],
'ImagesFileNames': [u'homepage-bcorp.png'],
'ImagesList': [],
'PositionList': [{'x': 8, 'y': 309}],
'SiteUrl': u'http://localhosts:8000/',
'WidthHeightList': [{'height': 192, 'width': 192}],
'depth': 1,
'domain': 'http://localhosts:8000',
'htmlImagesList': [],
'status': 'ok',
'totalAreaOfImages': 36864,
'totalNumberOfImages': 1}
2013-12-19 18:13:33-0700 [image] ERROR: Spider error processing <GET
http://<domain>:8000/pages/forbidden.html>
Traceback (most recent call last):
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib
/python/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/twisted/internet/task.py", line 602, in _tick
taskObj._oneWorkUnit()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/
Extras/lib/python/twisted/internet/task.py", line 479, in _oneWorkUnit
result = self._iterator.next()
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 57, in
<genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/Library/Python/2.7/site-packages/scrapy/utils/defer.py", line 96, in
iter_errback
yield next(it)
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py",
line 23, in process_spider_output
for x in result:
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py",
line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/Library/Python/2.7/site-
packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Python/2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py",
line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/Library/Python/2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 67,
in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "/Users/eddieantonio/Work/cultr/spider/cultr/spiders/ImageSpider.py", line 164,
in parse_images
driver.get(response.url)
File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py",
line 176, in get
self.execute(Command.GET, {'url': url})
File "/Library/Python/2.7/site-packages/selenium/webdriver/remote/webdriver.py",
line 162, in execute
response = self.command_executor.execute(driver_command, params)
File "/Library/Python/2.7/site-
packages/selenium/webdriver/remote/remote_connection.py", line 349, in execute
return self._request(url, method=command_info[0], data=data)
File "/Library/Python/2.7/site-
packages/selenium/webdriver/remote/remote_connection.py", line 410, in _request
resp = opener.open(request)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 404, in open
response = self._open(req, data)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py",
line 422, in _open
'_open', req)
File
"/System/Library/Frameworks/Python.framework/Versions/2.7
/lib/python2.7/urllib2.py",
line 382, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions
/2.7/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/
2.7/lib/python2.7/urllib2.py", line 1187, in do_open
r = h.getresponse(buffering=True)
File "/System/Library/Frameworks/Python.framework/Versions/
2.7/lib/python2.7/httplib.py", line 1045, in getresponse
response.begin()
File "/System/Library/Frameworks/Python.framework/Versions/
2.7/lib/python2.7/httplib.py", line 409, in begin
version, status, reason = self._read_status()
File
"/System/Library/Frameworks/Python.framework/Versions/
2.7/lib/python2.7/httplib.py", line 373,
in _read_status
raise BadStatusLine(line)
httplib.BadStatusLine: ''
答案 0 :(得分:0)
httplib.BadStatusLine
表示:
如果服务器以我们不理解的HTTP状态代码响应,则引发。
我认为抓取您定制的网站时返回了一些错误。您应该使用 scrapy shell 或 requests 来获取http://localhosts:8000/pages/forbidden.html
以查看结果。