scrapy错误处理url

时间:2017-03-28 10:55:59

标签: python scrapy-spider

您好我是python和scrapy的新手,我正在尝试编写蜘蛛代码,但我无法在处理起始网址时找到错误或错误解决方案, #39;不知道它是否是xpath或其他问题的问题,我发现的大多数线程都是关于错误的缩进,但这不是我的情况。 代码:

import scrapy
from scrapy.exceptions import CloseSpider

from scrapy_crawls.items import Vino


class BodebocaSpider(scrapy.Spider):
    name = "Bodeboca"
    allowed_domains = ["bodeboca.com"]
    start_urls = (
        'http://www.bodeboca.com/vino/espana',
    )
    counter = 1
    next_url = ""

    vino = None

    def __init__(self):
        self.next_url = self.start_urls[0]


    def parse(self, response):

        for sel in response.xpath(
                '//div[@id="venta-main-wrapper"]/div[@id="venta-main"]/div/div/div/div/div/div/span'):

            #print sel
            # HREF
            a_href = sel.xpath('.//a/@href').extract()
            the_href = a_href[0]
            print the_href
            yield scrapy.Request(the_href, callback=self.parse_item, headers={'Referer': response.url.encode('utf-8'),
                                                                              'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3'})

        # SIGUIENTE URL
        results = response.xpath(
            '//div[@id="wrapper"]/article/div[@id="article-inner"]/div[@id="default-filter-form-wrapper"]/div[@id="venta-main-wrapper"]/div[@class="bb-product-info-sort bb-sort-behavior-attached"]/div[@clsas="bb-product-info"]/span[@class="bb-product-info-count"]').extract()


        if not results:
            raise CloseSpider
        else:
            #self.next_url = self.next_url.replace(str(self.counter), str(self.counter + 1))
            #self.counter += 1
            self.next_url = response.xpath('//div[@id="venta-main-wrapper"]/div[@class="item-list"]/ul[@class="pager"]/li[@class="pager-next"]/a/@href').extract()[0]
            yield scrapy.Request(self.next_url, callback=self.parse, headers={'Referer': self.allowed_domains[0],
                                                                              'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3'})

错误:

2017-03-28 12:29:08 [scrapy.core.engine] INFO: Spider opened
2017-03-28 12:29:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-03-28 12:29:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.bodeboca.com/robots.txt> (referer: None)
2017-03-28 12:29:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.bodeboca.com/vino/espana> (referer: None)
/vino/terra-cuques-2014
2017-03-28 12:29:08 [scrapy.core.scraper] ERROR: Spider error processing <GET http://www.bodeboca.com/vino/espana> (referer: None)

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
    File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
  File "/home/gerardo/proyectos/vinos-diferentes-crawl/scrapy_crawls/spiders/Bodeboca.py", line 36, in parse
'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3'})
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 25, in __init__
self._set_url(url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /vino/terra-cuques-2014
2017-03-28 12:29:08 [scrapy.core.engine] INFO: Closing spider (finished)
2017-03-28 12:29:08 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 449,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 38558,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 3, 28, 10, 29, 8, 951654),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/ValueError': 1,
 'start_time': datetime.datetime(2017, 3, 28, 10, 29, 8, 690948)}
2017-03-28 12:29:08 [scrapy.core.engine] INFO: Spider closed (finished)

1 个答案:

答案 0 :(得分:1)

简单回答:你从页面相对网址中提取,例如     /vino/terra-cuques-2014

为了使scrapy请求url需要填满:     http://www.bodeboca.com/vino/terra-cuques-2014。 您可以使用Scrapy response.urljoin()创建完整网址 方法,例如: full_url = response.urljoin(url)

尽量不要使用像/div[@id="venta-main"]/div/div/div/div/div/div/span这样的xpath表达式 - 它很难阅读,并且可以轻松地从页面中的最轻微变化中解脱出来。相反,您可以简单地使用基于类的xpath://a[@class="verficha"]

您可以像这样重写部分蜘蛛:

def parse(self, response):
    links = response.xpath('//a[@class="verficha"]')
    for link in links:
        url = link.xpath('@href').extract_first()
        full_url = response.urljoin(url)
        yield scrapy.Request(full_url , callback= your callback)

如果要将网址提取到下一页,可以使用xpath next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first(),再次调用response.urljoin(next_page)等。