在Scrapy中重新加载页面

时间:2016-01-10 22:53:30

标签: python web-scraping scrapy

我对scrapy很新,而且我一直在试图抓住http://www.icbse.com/schools/state/maharashtra,但我遇到了一个问题。 在显示为可用的学校链接总数中,该页面仅以无序方式一次显示50个。

但是,如果页面重新加载,它会显示50个新的学校链接列表。其中一些与刷新前的第一个链接不同,而有些则保持不变。

我想要做的是将链接添加到Set(),一旦len(set)达到总学校的长度,我想将Set发送到解析功能。 我不明白解决这个问题的两件事。

  1. 在哪里定义一个set来保留链接,并且每次调用parse()时都不会刷新。
  2. 如何在scrapy中重新加载页面。
  3. 以下是我当前的代码:

    import scrapy
    import re
    from icbse.items import IcbseItem
    
    
    class IcbseSpider(scrapy.Spider):
        name = "icbse"
        allowed_domains = ["www.icbse.com"]
        start_urls = [
            "http://www.icbse.com/schools/",
        ]
    
        def parse(self, response):
            for i in xrange(20):  # I thought if i iterate the start URL,
            # I could probably have the page reload. 
            # It didn't work though.
                for href in response.xpath(
                        '//div[@class="row"]/div[3]//span[@class="list-group-item"]\
        /a/@href').extract():
                    url = response.urljoin(href)
                    yield scrapy.Request(url, callback=self.parse_dir_contents)
    
        def parse_dir_contents(self, response):
            # total number of schools found on page
            pages = response.xpath(
                "//div[@class='container']/strong/text()").extract()[0]
    
            self.captured_schools_set = set()  # Placing the Set here doesn't work!
    
            while len(self.captured_schools_set) != int(pages):
                yield scrapy.Request(response.url, callback=self.reload_url)
    
            for school in self.captured_schools_set:
                yield scrapy.Request(school, callback=self.scrape_school_info)
    
        def reload_url(self, response):
            for school_href in response.xpath(
                    "//h4[@class='school_name']/a/@href").extract():
                self.captured_schools_set.add(response.urljoin(school_href))
    
        def scrape_school_info(self, response):
    
            item = IcbseItem()
    
            try:
                item["School_Name"] = response.xpath(
                    '//td[@class="tfield"]/strong/text()').extract()[0]
            except:
                item["School_Name"] = ''
                pass
            try:
                item["streetAddress"] = response.xpath(
                    '//td[@class="tfield"]')[1].xpath(
                    "//span[@itemprop='streetAddress']/text()").extract()[0]
            except:
                item["streetAddress"] = ''
                pass
    
            yield item
    

1 个答案:

答案 0 :(得分:2)

你正在迭代一个空集:

        self.captured_schools_set = set()  # Placing the Set here doesn't work!

        while len(self.captured_schools_set) != int(pages):
            yield scrapy.Request(response.url, callback=self.reload_url)

        for school in self.captured_schools_set:
            yield scrapy.Request(school, callback=self.scrape_school_info)

因此,school的请求永远不会被解雇。

您应该使用dont_filter = True属性重新加载,触发http://www.icbse.com/schools/请求,因为在默认设置中,scrapy会过滤掉重复项。

但似乎您没有触发http://www.icbse.com/schools/个请求,而是(http://www.icbse.com/schools/state/andaman-nicobar)“/ state / name”请求;在上面的第4行,你正在解雇request.url,这是一个问题,改为/ schools /