Scrapy蜘蛛不想去下一页

时间:2017-07-08 15:42:48

标签: python scrapy web-crawler

全部,

我正在写一个scrapy爬虫,这是我之前提出的关于它的问题:Scrapy: AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls'

现在我遇到了另一个问题:它不想进入下一页:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://example.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["https://example.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
            'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

它可以工作并保存第一页链接的数据,但在尝试转到下一页时没有任何错误就失败了。这是日志:

…
2017-07-08 17:17:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com/book/?id=9780143039617>
{'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&amp;editorial=Penguin%20Books">Penguin Books</a></li>', u'<li>P\xe1ginas: 363</li>', u'<li>A\xf1o: 2206</li>', u'<li>Precio: 14.50 \u20ac</li>', u'<li>EAN: 9780143039617</li>']}
2017-07-08 17:17:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-08 17:17:25 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: bookstore_2.json
2017-07-08 17:17:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我在第一个蜘蛛中使用了下一页的部分,它正在运行。知道为什么会发生这种情况吗?

1 个答案:

答案 0 :(得分:1)

您的分页逻辑应该在parse方法而不是parse_following_urls方法的末尾,因为分页链接位于主页上,而不是在书籍详细信息页面上。另外,我必须从allowed_domains中删除该方案。最后,请注意,由于您没有导入Request模块,因此在parse方法结束时会产生scrapy。蜘蛛看起来像这样:

from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request

class YourCrawler(CrawlSpider):
    name = "bookstore_2"
    start_urls = [
    'https://lacentral.com/materias/?novedades=LC&p',
    ]
    allowed_domains = ["lacentral.com"]

    def parse(self, response):
        # go to the urls in the list
        s = Selector(response)
        page_list_urls = s.xpath('///[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
        for url in page_list_urls:
            yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)

        # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
        next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield Request(next_page, callback=self.parse)

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
    def parse_following_urls(self, response):
        #Parsing rules go here
        for each_book in response.css('div#main'):
            yield {
                'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
            }