Question

我能够从第一页中删除所有故事，我的问题是如何移动到下一页并继续抓取故事和名称，请检查下面的代码

# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
    name = scrapy.Field()
    story = scrapy.Field()
class MySpider(scrapy.Spider):

    name = 'cancerstories'
    allowed_domains = ['thebreastcancersite.greatergood.com']
    start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']

    def parse(self, response):

        rows = response.xpath('//a[contains(@href,"story")]')

        #loop over all links to stories
        for row in rows:
            myItem = MyItem() # Create a new item
            myItem['name'] = row.xpath('./text()').extract() # assign name from link
            story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
            request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
            request.meta['myItem'] = myItem # pass the item with the request
            yield request

    def parse_detail(self, response):
        myItem = response.meta['myItem'] # extract the item (with the name) from the response
        #myItem['name']=response.xpath('//h1[@class="headline"]/text()').extract()
        text_raw = response.xpath('//div[@class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
        myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
        yield myItem # return the item

Answer 1

您可以为CrawlSpider更改Rule，并使用LinkExtractor和... from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor ... rules = ( Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')), ) ... class MySpider(CrawlSpider): ...关注指向下一页的链接。

对于这种方法，您必须包含以下代码：

SgmlLinkExtractor

这样，对于您访问的每个页面，spider将创建对下一页（如果存在）的请求，在完成解析方法的执行时跟随它，并再次重复该过程。

修改

我写的规则只是按照下一页的链接而不是提取故事，如果你的第一种方法有效，就没有必要改变它。

此外，关于评论中的规则，attrs已弃用，因此我建议您使用默认的link extractor，并且规则本身没有明确定义。

如果未定义提取器中的参数href，它会搜索正在查找正文中../story/mother-of-4435标记的链接，在这种情况下，这些标记看起来像/clickToGive/bcs/story/mother-of-4435而不是{{1 }}。这就是它找不到任何链接的原因。

Answer 2

如果要使用scrapy.spider类，可以手动关注下一页，例如： next_page = response.css（'a.pageLink :: attr（href）'）。extract_first（）如果next_page： absolute_next_page_url = response.urljoin（next_page） yield scrapy.Request（url = absolute_next_page_url，callback = self.parse）如果要使用CralwSpider类，请不要忘记将解析方法重命名为parse_start_url

如何在Scrapy Crawler中关注下一页以废弃内容

2 个答案: