官方scrapy示例中的错误?

时间:2015-10-02 12:08:32

标签: python scrapy

尝试了documentation page上显示的示例scrapy用法 (名称下的示例:从单个回调中返回多个请求和项目)

我刚刚将域名更改为指向真实网站:

import scrapy

class MySpider(scrapy.Spider):
    name = 'huffingtonpost'
    allowed_domains = ['huffingtonpost.com/']
    start_urls = [
        'http://www.huffingtonpost.com/politics/',
        'http://www.huffingtonpost.com/entertainment/',
        'http://www.huffingtonpost.com/media/',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            yield scrapy.Request(url, callback=self.parse)

但在this gist中发布了ValuError。 有什么想法吗?

1 个答案:

答案 0 :(得分:4)

某些提取的链接是相对的(例如,/news/hillary-clinton/)。 你应该把它变成绝对的(http://www.huffingtonpost.com/news/hillary-clinton/

import scrapy

class MySpider(scrapy.Spider):
    name = 'huffingtonpost'
    allowed_domains = ['huffingtonpost.com/']
    start_urls = [
        'http://www.huffingtonpost.com/politics/',
        'http://www.huffingtonpost.com/entertainment/',
        'http://www.huffingtonpost.com/media/',
    ]

    def parse(self, response):
        for h3 in response.xpath('//h3').extract():
            yield {"title": h3}

        for url in response.xpath('//a/@href').extract():
            if url.startswith('/'):
                # transform url into absolute
                url = 'http://www.huffingtonpost.com' + url
            if url.startswith('#'):
                # ignore href starts with #
                continue
            yield scrapy.Request(url, callback=self.parse)