蜘蛛错误处理的Scrapy错误

时间:2016-05-11 09:53:25

标签: python scrapy

我是Python和Scrapy的新手,我跟随蜘蛛:

# scrapy_error.py
import scrapy
from scrapy import Request

class TextScrapper(scrapy.Spider):
    name = "tripadvisor"
    start_urls = [
        "https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Puerto_Madero-Buenos_Aires_Capital_Federal_District.html"
    ]

    def parse(self, response):
        full_review_page_links = response.xpath('//div[@class="quote isNew"]/a/@href').extract()
        res = [detail_link for detail_link in full_review_page_links]
        if res:
            yield scrapy.Request("https://www.tripadvisor.com/" + res[0])

每次我用

运行这个蜘蛛
> scrapy runspider scrapy_error.py

我收到以下错误:

2016-05-11 15:00:50 [scrapy] INFO: Spider opened
2016-05-11 15:00:50 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-11 15:00:50 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-11 15:00:52 [scrapy] DEBUG: Redirecting (301) to <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Pue
rto_Madero-Buenos_Aires_Capital_Federal_District.html> from <GET https://www.tripadvisor.com/Holiday+Inn+Express+PUERTO+MADERO>
2016-05-11 15:00:54 [scrapy] DEBUG: Crawled (200) <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_Puerto_Mad
ero-Buenos_Aires_Capital_Federal_District.html> (referer: None)
review_370284375
New Item is Added To The Data Collection
2016-05-11 15:00:54 [scrapy] ERROR: Spider error processing <GET https://www.tripadvisor.com/Hotel_Review-g312741-d306930-Reviews-Holiday_Inn_Express_
Puerto_Madero-Buenos_Aires_Capital_Federal_District.html> (referer: None)
Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
GeneratorExit
Exception RuntimeError: 'generator ignored GeneratorExit' in <generator object iter_errback at 0x040ECB48> ignored
Unhandled error in Deferred:
2016-05-11 15:00:54 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1194, in run
    self.mainLoop()
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1203, in mainLoop
    self.runUntilCurrent()
  File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 825, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 671, in _tick
    taskObj._oneWorkUnit()
--- <exception caught here> ---
  File "C:\Python27\lib\site-packages\twisted\internet\task.py", line 517, in _oneWorkUnit
    result = next(self._iterator)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\defer.py", line 63, in <genexpr>
    work = (callable(elem, *args, **named) for elem in iterable)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\scraper.py", line 183, in _process_spidermw_output
    self.crawler.engine.crawl(request=output, spider=spider)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\engine.py", line 198, in crawl
    self.schedule(request, spider)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\engine.py", line 204, in schedule
    if not self.slot.scheduler.enqueue_request(request):
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\core\scheduler.py", line 51, in enqueue_request
    if not request.dont_filter and self.df.request_seen(request):
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\dupefilters.py", line 48, in request_seen
    fp = self.request_fingerprint(request)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\dupefilters.py", line 56, in request_fingerprint
    return request_fingerprint(request)
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\request.py", line 53, in request_fingerprint
    fp.update(to_bytes(canonicalize_url(request.url)))
  File "C:\Python27\lib\site-packages\scrapy-1.1.0rc1-py2.7.egg\scrapy\utils\url.py", line 67, in canonicalize_url
    path = safe_url_string(_unquotepath(path)) or '/'
  File "C:\Python27\lib\site-packages\w3lib\url.py", line 97, in safe_url_string
    to_native_str(parts.netloc.encode('idna')),
  File "C:\Python27\lib\encodings\idna.py", line 164, in encode
    result.append(ToASCII(label))
  File "C:\Python27\lib\encodings\idna.py", line 73, in ToASCII
    raise UnicodeError("label empty or too long")
exceptions.UnicodeError: label empty or too long
2016-05-11 15:00:54 [twisted] CRITICAL:
2016-05-11 15:00:54 [scrapy] INFO: Closing spider (finished)
None
None
2016-05-11 15:00:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1014,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 104006,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 5, 11, 9, 30, 54, 488000),
 'log_count/CRITICAL': 2,
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'spider_exceptions/GeneratorExit': 1,
 'start_time': datetime.datetime(2016, 5, 11, 9, 30, 50, 817000)}
2016-05-11 15:00:54 [scrapy] INFO: Spider closed (finished)

我正在使用Scrapy 1.1.0rc1。我尝试了一些重新安装Python和Scrapy的东西,但没有任何帮助。

1 个答案:

答案 0 :(得分:0)

这看起来像Scrapy中的错误,直到版本1.1.0rc3。

版本1.1.0rc4工作正常,使用以下命令安装此特定版本:

> pip install scrapy==1.1.0rc4