在页面中刮取页面有时不会进入第二页

时间:2018-06-12 11:08:28

标签: python web-scraping scrapy

我在这里使用以下蜘蛛:

import scrapy

questions = {}

class SovSpider(scrapy.Spider):
    name = 'StackOverflow'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions']

    def parse(self, response):
        for link in response.css('a.question-hyperlink::attr(href)').extract():

            yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)

            yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)


    def parse_questions(self, response):
        questions["title"] = response.css('a.question-hyperlink::text').extract_first()
        questions["user"] = response.css('.user-details a::text').extract_first()

        yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()), callback=self.parse_user)

        yield questions

    def parse_user(self, response):
        questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()

尝试 Practice 抓取网页,然后从同一网页获取网址以抓取其网页[Page1(Scraped) -[Page1[Url-Inside]]> Page2(Scrape)]

Spider的作用如下:

  1. 抓取Questions Page个网址

  2. 从中抓取Question Title 通过 URLs

  3. 输入了Page
  4. 从中抓取User Reputation 用户的 Scraped URL Question

  5. 输入的用户页面

    因此,举例来说,我的问题是假设给我以下内容:

    {"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo", "user_reputation": 455}
    

    问题是,几乎 3/4 的已删除项目只会返回parse_question部分

    {"title": "Scraping a Page within a Page sometimes doesn't enter the second Page", "user": "Toleo"}
    

    有时候没有,这里有什么问题?

1 个答案:

答案 0 :(得分:1)

问题是您在产生parse_user的同时产生了questions的请求,但是项目和请求由不同的中间件处理,因此它们不会被一个接一个地执行。 / p>

您最好使用meta将问题的第一部分发送至parse_user,并在questions

中仅生成parse_user
def parse_questions(self, response):
    questions = {}

    questions["title"] = response.css('a.question-hyperlink::text').extract_first()
    questions["user"] = response.css('.user-details a::text').extract_first()

    yield scrapy.Request(url=response.urljoin(response.css('.user-details a::attr(href)').extract_first()),
                         callback=self.parse_user,
                         meta={'questions': questions})

def parse_user(self, response):
    questions = response.meta.get('questions')
    questions["user_reputation"] = response.css('div.reputation::text').extract_first().strip()
    yield questions

您最好在每次调用questions时创建一个新变量parse_questions,因为它不应该是全局变量。

此外,应该像这样纠正parse

def parse(self, response):
    for link in response.css('a.question-hyperlink::attr(href)').extract():

        yield scrapy.Request(url=response.urljoin(link), callback=self.parse_questions)

    yield scrapy.Request(url=response.urljoin(response.css('a[rel="next"]::attr(href)').extract_first()), callback=self.parse)

因为您在一个页面上为每个链接发出了下一页的请求,这不是一个问题,因为scrapy作为一个dupefilter但它可能会更有效