Question

我正在尝试使用Scrapy编写一个Spider，如果所说的URL包含我将在打印response.css(".class")中定义的某个类，则理想地它将从站点返回URL列表。不知道是否只有当用户登录后该类才会出现在页面上，这是否有可能。

我已经阅读了有关如何使用Scrapy编写Spider的指南，并且了解了如何使用页面上已知的其他类（无论用户是否登录）返回选择器列表，就像知道我没有写错的测试。我真的只是想知道这是否有可能，如果可以的话，我可以采取什么步骤来到达那里。

import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['www.example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        print response.css(".class")

到目前为止，我所拥有的代码显然是非常基本的，并且几乎不从生成的模板进行编辑，因为我仍处于此代码的简单测试阶段。理想情况下，我想获得一个选择器列表，如果可能的话，然后为我提供找到该类的每个页面的URL列表。我要寻找的只是包含已定义类的页面的URL。

Answer 1

我不清楚您的问题。我假设您要获取具有特定类属性的URL。如果您要这样做，则可以更改蜘蛛的解析方法的定义：

def parse(self, response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

此外，仅当您登录目标网站时，您才想抓取的信息才可用，然后您应该设置form request进行身份验证。

class LoginSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['http://www.example.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'yourusername', 'password': 'yourpassword'},
            callback=self.after_login
        )

def after_login(self, response):
    if "login failed" in response.body:
        self.logger.error("Login failed")
        return
    else:
       return scrapy.Request(url="www.webpageyouwanttoscrape.com",callback=self.get_all_urls)

def get_all_urls(self,response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

有关form requests的更多信息，请查看下面的链接： https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login

如何抓取仅在登录时显示的类？

1 个答案: