我能够从第一页中删除所有故事,我的问题是如何移动到下一页并继续抓取故事和名称,请检查下面的代码
# -*- coding: utf-8 -*-
import scrapy
from cancerstories.items import CancerstoriesItem
class MyItem(scrapy.Item):
name = scrapy.Field()
story = scrapy.Field()
class MySpider(scrapy.Spider):
name = 'cancerstories'
allowed_domains = ['thebreastcancersite.greatergood.com']
start_urls = ['http://thebreastcancersite.greatergood.com/clickToGive/bcs/stories/']
def parse(self, response):
rows = response.xpath('//a[contains(@href,"story")]')
#loop over all links to stories
for row in rows:
myItem = MyItem() # Create a new item
myItem['name'] = row.xpath('./text()').extract() # assign name from link
story_url = response.urljoin(row.xpath('./@href').extract()[0]) # extract url from link
request = scrapy.Request(url = story_url, callback = self.parse_detail) # create request for detail page with story
request.meta['myItem'] = myItem # pass the item with the request
yield request
def parse_detail(self, response):
myItem = response.meta['myItem'] # extract the item (with the name) from the response
#myItem['name']=response.xpath('//h1[@class="headline"]/text()').extract()
text_raw = response.xpath('//div[@class="photoStoryBox"]/div/p/text()').extract() # extract the story (text)
myItem['story'] = ' '.join(map(unicode.strip, text_raw)) # clean up the text and assign to item
yield myItem # return the item
答案 0 :(得分:2)
您可以为CrawlSpider
更改Rule
,并使用LinkExtractor
和...
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
...
rules = (
Rule(LinkExtractor(allow='\.\./stories;jsessionid=[0-9A-Z]+?page=[0-9]+')),
)
...
class MySpider(CrawlSpider):
...
关注指向下一页的链接。
对于这种方法,您必须包含以下代码:
SgmlLinkExtractor
这样,对于您访问的每个页面,spider将创建对下一页(如果存在)的请求,在完成解析方法的执行时跟随它,并再次重复该过程。
修改强>
我写的规则只是按照下一页的链接而不是提取故事,如果你的第一种方法有效,就没有必要改变它。
此外,关于评论中的规则,attrs
已弃用,因此我建议您使用默认的link extractor,并且规则本身没有明确定义。
如果未定义提取器中的参数href
,它会搜索正在查找正文中../story/mother-of-4435
标记的链接,在这种情况下,这些标记看起来像/clickToGive/bcs/story/mother-of-4435
而不是{{1 }}。这就是它找不到任何链接的原因。
答案 1 :(得分:0)
如果要使用scrapy.spider类,可以手动关注下一页,例如: next_page = response.css('a.pageLink :: attr(href)')。extract_first() 如果next_page: absolute_next_page_url = response.urljoin(next_page) yield scrapy.Request(url = absolute_next_page_url,callback = self.parse) 如果要使用CralwSpider类,请不要忘记将解析方法重命名为parse_start_url