Question

我对scrapy很新，而且我一直在试图抓住http://www.icbse.com/schools/state/maharashtra，但我遇到了一个问题。在显示为可用的学校链接总数中，该页面仅以无序方式一次显示50个。

但是，如果页面重新加载，它会显示50个新的学校链接列表。其中一些与刷新前的第一个链接不同，而有些则保持不变。

我想要做的是将链接添加到Set()，一旦len(set)达到总学校的长度，我想将Set发送到解析功能。我不明白解决这个问题的两件事。

在哪里定义一个set来保留链接，并且每次调用parse（）时都不会刷新。
如何在scrapy中重新加载页面。

以下是我当前的代码：

import scrapy
import re
from icbse.items import IcbseItem


class IcbseSpider(scrapy.Spider):
    name = "icbse"
    allowed_domains = ["www.icbse.com"]
    start_urls = [
        "http://www.icbse.com/schools/",
    ]

    def parse(self, response):
        for i in xrange(20):  # I thought if i iterate the start URL,
        # I could probably have the page reload. 
        # It didn't work though.
            for href in response.xpath(
                    '//div[@class="row"]/div[3]//span[@class="list-group-item"]\
    /a/@href').extract():
                url = response.urljoin(href)
                yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        # total number of schools found on page
        pages = response.xpath(
            "//div[@class='container']/strong/text()").extract()[0]

        self.captured_schools_set = set()  # Placing the Set here doesn't work!

        while len(self.captured_schools_set) != int(pages):
            yield scrapy.Request(response.url, callback=self.reload_url)

        for school in self.captured_schools_set:
            yield scrapy.Request(school, callback=self.scrape_school_info)

    def reload_url(self, response):
        for school_href in response.xpath(
                "//h4[@class='school_name']/a/@href").extract():
            self.captured_schools_set.add(response.urljoin(school_href))

    def scrape_school_info(self, response):

        item = IcbseItem()

        try:
            item["School_Name"] = response.xpath(
                '//td[@class="tfield"]/strong/text()').extract()[0]
        except:
            item["School_Name"] = ''
            pass
        try:
            item["streetAddress"] = response.xpath(
                '//td[@class="tfield"]')[1].xpath(
                "//span[@itemprop='streetAddress']/text()").extract()[0]
        except:
            item["streetAddress"] = ''
            pass

        yield item

Answer 1

你正在迭代一个空集：

        self.captured_schools_set = set()  # Placing the Set here doesn't work!

        while len(self.captured_schools_set) != int(pages):
            yield scrapy.Request(response.url, callback=self.reload_url)

        for school in self.captured_schools_set:
            yield scrapy.Request(school, callback=self.scrape_school_info)

因此，school的请求永远不会被解雇。

您应该使用dont_filter = True属性重新加载，触发http://www.icbse.com/schools/请求，因为在默认设置中，scrapy会过滤掉重复项。

但似乎您没有触发http://www.icbse.com/schools/个请求，而是（http://www.icbse.com/schools/state/andaman-nicobar）“/ state / name”请求;在上面的第4行，你正在解雇request.url，这是一个问题，改为/ schools /

在Scrapy中重新加载页面

1 个答案: