Scrapy的分页错误

时间:2017-07-28 04:54:23

标签: pagination scrapy scrapy-spider

嗨,大家好,我在拼抢网站时遇到以下分页错误

2017-07-27 18:30:21 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994> (referer: None)
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/usr/local/lib/python3.5/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Documents/Spiders/pedidosYa/pedidosYa/spiders/pedidosya.py", line 35, in parse
    next_page_url = response.urljoin(next_page_url)
  File "/usr/local/lib/python3.5/dist-packages/scrapy/http/response/text.py", line 82, in urljoin
    return urljoin(get_base_url(self), url)
  File "/usr/lib/python3.5/urllib/parse.py", line 416, in urljoin
    base, url, _coerce_result = _coerce_args(base, url)
  File "/usr/lib/python3.5/urllib/parse.py", line 112, in _coerce_args
    raise TypeError("Cannot mix str and non-str arguments")
TypeError: Cannot mix str and non-str arguments
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-27 18:30:21 [scrapy.extensions.feedexport] INFO: Stored csv feed (13 items) in: test3.csv
2017-07-27 18:30:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 653,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 62571,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 7, 27, 23, 30, 21, 221038),
 'item_scraped_count': 13,
 'log_count/DEBUG': 16,
 'log_count/ERROR': 1,
 'log_count/INFO': 8,
 'memusage/max': 49278976,
 'memusage/startup': 49278976,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/TypeError': 1,
 'start_time': datetime.datetime(2017, 7, 27, 23, 30, 17, 538310)}
2017-07-27 18:30:21 [scrapy.core.engine] INFO: Spider closed (finished)

蜘蛛正在引发类型错误&#34;不能混合str和非str参数&#34;我对pyhton不是很有经验,我也会贬低一些 资源,我可以了解这种类型的错误。贝娄你会找到蜘蛛的代码。

# -*- coding: utf-8 -*-
import scrapy
from pedidosYa.items import PedidosyaItem
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose


class PedidosyaSpider(scrapy.Spider):
    name = 'pedidosya'
    allowed_domains = ['www.pedidosya.com.br']
    start_urls = [
        'https://www.pedidosja.com.br/restaurantes/sao-paulo?a=rua%20tenente%20negr%C3%A3o%20200&cep=04530030&doorNumber=200&bairro=Itaim%20Bibi&lat=-23.585202&lng=-46.671715199999994']

    def parse(self, response):
        # need to define wrapper
        for wrapper in response.css('.restaurant-wrapper.peyaCard.show.with_tags'):
            l = ItemLoader(item=PedidosyaItem(), selector=wrapper)
            l.add_css('Name', 'a.arrivalName::text')
            l.add_css('Menu1', 'span.categories > span::text', MapCompose(str.strip))
            l.add_css('Menu2', 'span.categories > span + span::text', MapCompose(str.strip))
            l.add_css('Menu3', 'span.categories > span + span + span::text', MapCompose(str.strip))
            l.add_css('Address', 'span.address::text', MapCompose(str.strip))
            l.add_css('DeliveryTime', 'i.delTime::text', MapCompose(str.strip))
            l.add_css('CreditCard', 'ul.content_credit_cards > li > img::attr(alt)', MapCompose(str.strip))
            l.add_css('DeliveryCost', 'div.shipping > i::text', MapCompose(str.strip))
            l.add_css('Rankink', 'span.ranking i::text', MapCompose(str.strip))
            l.add_css('No', 'span.ranking a::text', MapCompose(str.strip))
            l.add_css('Sponsor', 'span.grey_small.not-logged::text', MapCompose(str.strip))
            l.add_css('DeliveryMinimun', 'div.minDelivery::text', MapCompose(str.strip))
            l.add_css('Distance', 'div.distance i::text', MapCompose(str.strip))
            yield l.load_item()

        next_page_url = response.css('li.arrow.next > a ::attr(href)').extract()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
        yield scrapy.Request(url=next_page_url, callback=self.parse)

提前感谢您,祝您有个美好的一天!

2 个答案:

答案 0 :(得分:1)

next_page_url = response.css('li.arrow.next > a ::attr(href)').extract()
                                                              ^^^^^^^^^^
if next_page_url:
    next_page_url = response.urljoin(next_page_url)
                                     ^^^^^^^^^^^^^

在创建urljoin时,extract()方法会在列表中调用next_page_url来返回所有值的列表,即使它只是一个成员。
要解决此问题,请改为使用extract_first()

next_page_url = response.css('li.arrow.next > a ::attr(href)').extract_first()
                                                               ^^^^^^^^^^^^^^^

答案 1 :(得分:0)

问题出在这一行:

-(IBAction)onSignOutClick:(id)sender    
{        
    SettingsStore *foo = [[SettingsStore alloc]init];    
    [foo removeAccount];    
    [self.navigationController pushViewController:foo animated:YES];       
    exit(0);
}

因为next_page_url = response.css('li.arrow.next > a::attr(href)').extract() 方法总是返回结果列表,即使它只找到一个。使用extract()方法,它只会给你第一个结果:

extract_first()

或者自己获取结果列表的第一个元素:

next_page_url = response.css('li.arrow.next > a::attr(href)').extract_first()