通过scrapy登录网站

时间:2017-06-03 16:07:57

标签: scrapy

我在网站上进行官方视频课程授权。 如果用户名和密码不正确,则转换到回调方法成功,如果登录和密码正确,则转换到该方法是不可行的。 我的代码: 进口scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://www.darkorbit.com"]

    def parse(self, response):
        login_url = response.css('form[name="bgcdw_login_form"]::attr(action)').extract_first()
        data = {
            'username': 'testscrapy',
            'password': 'testtest',
        }
        yield scrapy.FormRequest(url=login_url, formdata=data, callback=self.after_login)

    def after_login(self, response):
        print('----------------------------------------')

使用正确的输入数据,可以获得日志(切断长片段):

2017-06-03 22:04:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.darkorbit.com> (referer: None)
2017-06-03 22:04:42 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://auth3.bpsecure.com/robots.txt> (referer: None)
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.darkorbit.com/ProjectAp........>
2017-06-03 22:04:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ru4.darkorbit.com/Pro..........>
2017-06-03 22:04:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ru4.darkorbit.com/robots.txt> (referer: None)
2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>

1 个答案:

答案 0 :(得分:0)

从日志的这一行开始:

2017-06-03 22:04:43 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://ru4.darkorbit.com/Pro......>

我可以告诉您需要更改 settings.py 文件中的设置。

变量ROBOTSTXT_OBEY需要设置为False

ROBOTSTXT_OBEY=False