scrapy返回401未经授权的回应

时间:2020-09-14 00:41:55

标签: python web-scraping scrapy python-requests

网站是https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso

        yield Request(self.url, callback=self.login_me, dont_filter=True)

返回<html><head><title>Error</title></head><body>Unauthorized</body></html>

但是如果我确实使用了请求库,那就很好了!

它为什么会发生?

更新:

普通标题看起来像

Host: www.extratodebito.detran.pr.gov.br
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-User: ?1
Sec-Fetch-Dest: document
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

我将其添加到scrapy,但是我可以看到在请求期间添加的授权字段

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Authorization: Basic MTM2ZGNjNmFhOWZmNDA1Njk1YWU1MWE0ZjI1MzZlYzE6
Host: www.extratodebito.detran.pr.gov.br

更新2:

通过在蜘蛛中删除用于飞溅的http_user和http_pass来解决,但也使用scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware发送给通常的请求

1 个答案:

答案 0 :(得分:0)

当添加Accept,Accept-Language和Accept-Encoding标头时,对我来说工作正常。 我在scrapy shell中进行了测试:

headers = {'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], b'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate,br']}
url = "https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso"
from scrapy import Request
req = Request(url, headers=headers)
fetch(req)

我收到200条回复:

2020-09-14 11:16:03 [scrapy.core.engine] INFO: Spider opened
2020-09-14 11:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.extratodebito.detran.pr.gov.br/detranextratos/geraExtrato.do?action=iniciarProcesso> (referer: None)