我在这里使用以下蜘蛛:
import scrapy
questions = {}
class SovSpider(scrapy.Spider):
name = 'StackOverflow'
allowed_domains = ['stackoverflow.com']
start_urls = ['https://stackoverflow.com/questions']
def parse(self, response):
for link in response.css('a.question-hyperlink::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)
yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)
def parse_questions(self, response):
questions["title"] = response.css('a.question-hyperlink::text').extract_first()
questions["user"] = response.css('.user-details a::text').extract_first()
yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()), callback=self.parse_user)
yield questions
def parse_user(self, response):
questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()
尝试 Practice
抓取网页,然后从同一网页获取网址以抓取其网页[Page1(Scraped) -[Page1[Url-Inside]]> Page2(Scrape)]
Spider的作用如下:
抓取Questions Page
个网址
从中抓取Question Title
通过 URLs
从中抓取User Reputation
用户的 Scraped URL
Question
因此,举例来说,我的问题是假设给我以下内容:
{"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo", "user_reputation": 455}
问题是,几乎 3/4 的已删除项目只会返回parse_question
部分
{"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo"}
有时候没有,这里有什么问题?
答案 0 :(得分:1)
问题是您在产生parse_user
的同时产生了questions
的请求,但是项目和请求由不同的中间件处理,因此它们不会被一个接一个地执行。 / p>
您最好使用meta将问题的第一部分发送至parse_user
,并在questions
parse_user
def parse_questions(self, response):
questions = {}
questions["title"] = response.css('a.question-hyperlink::text').extract_first()
questions["user"] = response.css('.user-details a::text').extract_first()
yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()),
callback=self.parse_user,
meta={'questions': questions})
def parse_user(self, response):
questions = response.meta.get('questions')
questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()
yield questions
您最好在每次调用questions
时创建一个新变量parse_questions
,因为它不应该是全局变量。
此外,应该像这样纠正parse
def parse(self, response):
for link in response.css('a.question-hyperlink::attr(href)').extract():
yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)
yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)
因为您在一个页面上为每个链接发出了下一页的请求,这不是一个问题,因为scrapy作为一个dupefilter但它可能会更有效