拼凑规则中的特殊字符

时间:2019-09-05 20:22:37

标签: python scrapy

我正在尝试抓取新闻网站:https://www.larazon.es/etiquetas/noticias/meta/politica#.p:3; 我首先使用以下脚本测试了响应,然后看到了它的效果:

class StackSpider(Spider):
    name = 'crawler_larazon'
    allowed_domains = ['larazon.es']
    start_urls = ['https://www.larazon.es/etiquetas/noticias/meta/politica#.p:3']


    def parse(self, response):
        from scrapy.shell import inspect_response
        inspect_response(response, self)

但是,添加我的选择器和规则时,我没有得到任何回应。我是新手,但是我对可能发生的事情有2个假设:

  • linkextractor网址上的特殊字符弄乱了我的刮板。我检查并测试了一些正则表达式,但似乎无济于事
    rules = [
    Rule(LinkExtractor(allow=r'etiquetas/noticias/meta/politica#.p:[2-3];'),
         callback='parse_item', follow=True)
    ]
  • 页面加载需要一段时间,因此我不确定是否需要配置超时
class StackCrawlerSpider(CrawlSpider):
    name = 'crawler_larazon'
    allowed_domains = ['larazon.es']
    start_urls = ['https://www.larazon.es/etiquetas/noticias/meta/politica']

    rules = [
    Rule(LinkExtractor(allow=r'etiquetas/noticias/meta/politica#.p:[2-3];'),
         callback='parse_item', follow=True)
    ]

    def parse_item(self, response):
        questions = response.xpath('//h2[@class="news__new__title news__new__title"]')
        for question in questions:
            item = StackItem()
            item['url'] = question.xpath(
                'a/@href').extract()[0]
            item['source'] = self.allowed_domains[0]
            yield item

对我所缺少的东西有什么想法吗? 非常感谢!

0 个答案:

没有答案