我怎样才能跳到Scrapy的下一页

时间:2016-07-25 18:20:07

标签: python scrapy web-crawler

我正在尝试使用scrapy从here获取结果。问题是,在单击“加载更多结果”选项卡之前,并非所有类都显示在页面上。

问题可以在这里看到:

enter image description here

我的代码如下所示:

class ClassCentralSpider(CrawlSpider):
    name = "class_central"
    allowed_domains = ["www.class-central.com"]
    start_urls = (
        'https://www.class-central.com/courses/recentlyAdded',
    )
    rules = (
        Rule(
            LinkExtractor(
                # allow=("index\d00\.html",),
                restrict_xpaths=('//div[@id="show-more-courses"]',)
            ),
            callback='parse',
            follow=True
        ),
    )

def parse(self, response):
    x = response.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print item['name']

    pass

1 个答案:

答案 0 :(得分:1)

此网站的第二页似乎是通过AJAX调用生成的。如果您查看任何浏览器检查工具的网络选项卡,您将看到如下内容:

firebug network tab

在这种情况下,它似乎是从https://www.class-central.com/maestro/courses/recentlyAdded?page=2&_=1469471093134

检索json文件

现在看来url参数_=1469471093134什么也没做,所以你可以把它修剪成:https://www.class-central.com/maestro/courses/recentlyAdded?page=2
返回json包含下一页的html代码:

# so you just need to load it up with 
data = json.loads(response.body) 
# and convert it to scrapy selector - 
sel = Selector(text=data['table'])

要在代码中复制此内容,请尝试以下操作:

from w3lib.url import add_or_replace_parameter 
def parse(self, response):
    # check if response is json, if so convert to selector
    if response.meta.get('is_json',False):
        # convert the json to scrapy.Selector here for parsing
        sel = Selector(text=json.loads(response.body)['table'])
    else:
        sel = Selector(response) 
    # parse page here for items
    x = sel.xpath('//span[@class="course-name-text"]/text()').extract()
    item = ClasscentralItem()
    for y in x:
        item['name'] = y
        print(item['name'])
    # do next page
    next_page_el = respones.xpath("//div[@id='show-more-courses']")
    if next_page_el:  # there is next page
        next_page = response.meta.get('page',1) + 1
        # make next page url
        url = add_or_replace_parameter(url, 'page', next_page)
        yield Request(url, self.parse, meta={'page': next_page, 'is_json': True)