Question

我正在尝试编写一个简单的抓取脚本，以用我需要的技术来抓取Google夏季的代码组织。其工作正在进行中。我的parse函数运行正常，但是每当我回调到org函数时，它都不会抛出任何输出。

# -*- coding: utf-8 -*-
import scrapy



class GsocSpider(scrapy.Spider):
    name = 'gsoc'
    allowed_domains = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
    start_urls = ['https://summerofcode.withgoogle.com/archive/2018/organizations/']
    def parse(self, response):
        for href in response.css('li.organization-card__container a.organization-card__link::attr(href)'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback = self.parse_org)

    def parse_org(self,response):
        tech=response.css('li.organization__tag organization__tag--technology::text').extract()
    #if 'python' in tech:
        yield
        {
        'name':response.css('title::text').extract_first()
        #'ideas_list':response.css('')
    }

Answer 1

首先，您在the documentation中指定的allowed_domains配置错误：

包含蜘蛛网域的字符串的可选列表   允许爬行。请求不属于域名的URL   如果出现以下情况，则不会遵循此列表（或其子域）中指定的内容   已启用OffsiteMiddleware。

假设您的目标网址是https://www.example.com/1.html，然后添加   “ example.com”。

如您所见，您只需要包括域，这是一种过滤功能（因此不会对其他域进行爬网）。另外这是可选的，因此我实际上建议不要包括它。

另外，您的css用于获取tech是不正确的，应该是：

li.organization__tag.organization__tag--technology

Scrapy没有遵循下一个解析功能

1 个答案: