我是Python和Scrapy的新手,我跟随蜘蛛:
# scrapy_error.py
import scrapy
from scrapy import Request
class TextScrapper(scrapy.Spider):
name = "tripadvisor"
start_urls = [
"https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Puerto_Madero-Buenos_Aires_Capital_Federal_District.html"
]
def parse(self, response):
full_review_page_links = response.xpath('//div[@class="quote isNew"]/a/@href').extract()
res = [detail_link for detail_link in full_review_page_links]
if res:
yield scrapy.Request("https://www.tripadvisor.com/" + res[0])
每次我用
运行这个蜘蛛> scrapy runspider scrapy_error.py
我收到以下错误:
2016-05-11 15:00:50 [scrapy] INFO: Spider opened
2016-05-11 15:00:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-11 15:00:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-11 15:00:52 [scrapy] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Pue
rto_Madero-Buenos_Aires_Capital_Federal_District.html> from <GET https://www.tripadvisor.com/Holiday+Inn+Express+PUERTO+MADERO>
2016-05-11 15:00:54 [scrapy] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Puerto_Mad
ero-Buenos_Aires_Capital_Federal_District.html> (referer: None)
review_370284375
New Item is Added To The Data Collection
2016-05-11 15:00:54 [scrapy] ERROR: Spider error processing <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_
Puerto_Madero-Buenos_Aires_Capital_Federal_District.html> (referer: None)
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x040ECB48> ignored
Unhandled error in Deferred:
2016-05-11 15:00:54 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1194, in run
self.mainLoop()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1203, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 825, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 671, in _tick
taskObj._oneWorkUnit()
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\defer.py", line 63, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\engine.py", line 198, in crawl
self.schedule(request, spider)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\engine.py", line 204, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\scheduler.py", line 51, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\dupefilters.py", line 48, in request_seen
fp = self.request_fingerprint(request)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\dupefilters.py", line 56, in request_fingerprint
return request_fingerprint(request)
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\request.py", line 53, in request_fingerprint
fp.update(to_bytes(canonicalize_url(request.url)))
File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\url.py", line 67, in canonicalize_url
path = safe_url_string(_unquotepath(path)) or '/'
File "C:\Python27\lib\site-packages\w3lib\url.py", line 97, in safe_url_string
to_native_str(parts.netloc.encode('idna')),
File "C:\Python27\lib\encodings\idna.py", line 164, in encode
result.append(ToASCII(label))
File "C:\Python27\lib\encodings\idna.py", line 73, in ToASCII
raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long
2016-05-11 15:00:54 [twisted] CRITICAL:
2016-05-11 15:00:54 [scrapy] INFO: Closing spider (finished)
None
None
2016-05-11 15:00:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1014,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 104006,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 11, 9, 30, 54, 488000),
'log_count/CRITICAL': 2,
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/GeneratorExit': 1,
'start_time': datetime.datetime(2016, 5, 11, 9, 30, 50, 817000)}
2016-05-11 15:00:54 [scrapy] INFO: Spider closed (finished)
我正在使用Scrapy 1.1.0rc1。我尝试了一些重新安装Python和Scrapy的东西,但没有任何帮助。
答案 0 :(得分:0)
这看起来像Scrapy中的错误,直到版本1.1.0rc3。
版本1.1.0rc4工作正常,使用以下命令安装此特定版本:
> pip install scrapy==1.1.0rc4