python请求:向重定向的请求添加“ referer”标头

时间:2020-02-26 22:17:09

标签: python python-requests pycurl http-referer request-headers

我想知道python请求是否支持curl中的“自动引荐”功能。基本上,对于allow_redirects=True,请求应为后续的重定向请求自动设置“ Referer”标头。

以下是使用请求的请求标头的外观(没有“ Referer”标头):

>>> import requests
>>> import logging
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> logging.basicConfig()
>>> logging.getLogger().setLevel(logging.DEBUG)
>>> requests_log = logging.getLogger("requests.packages.urllib3")
>>> requests_log.setLevel(logging.DEBUG)
>>> requests_log.propagate = True
>>> r = requests.post('http://www.somewebsite.com', allow_redirects=True)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): www.somewebsite.com:80
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 307 Temporary Redirect\r\n'
DEBUG:urllib3.connectionpool:http://www.somewebsite.com:80 "POST / HTTP/1.1" 307 185
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.somewebsite.com:443
header: Server header: Date header: Content-Type header: Content-Length header: Connection header: Location header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
DEBUG:urllib3.connectionpool:https://www.somewebsite.com:443 "POST / HTTP/1.1" 302 13
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): somewebsite.com:443
header: Content-Type header: Content-Length header: Connection header: Date header: Location header: Access-Control-Allow-Origin header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'GET / HTTP/1.1\r\nHost: somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:urllib3.connectionpool:https://somewebsite.com:443 "GET / HTTP/1.1" 200 149681
header: Content-Type header: Content-Length header: Connection header: Date header: Server header: Expires header: Last-Modified header: Content-Encoding header: Via header: Vary header: Accept-Ranges header: Cache-Control header: Set-Cookie header: X-Cache header: X-Amz-Cf-Pop header: X-Amz-Cf-Id >>> 
>>> 

这是使用pycurl的请求标头(带有“ Referer”标头)的样子:

>>> import pycurl
>>> from io import BytesIO
>>> buffer = BytesIO()
>>> c = pycurl.Curl()
>>> c.setopt(c.URL, 'http://www.somewebsite.com/')
>>> c.setopt(c.WRITEDATA, buffer)
>>> c.setopt(pycurl.VERBOSE, 1)
>>> c.setopt(pycurl.AUTOREFERER, 1)
>>> c.setopt(pycurl.FOLLOWLOCATION, 1)
>>> c.perform()
>>> c.close()
*   Trying 99.84.194.56...
* Connected to www.somewebsite.com (99.84.194.56) port 80 (#0)
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*

< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://www.somewebsite.com/
< X-Cache: Redirect from cloudfront
< Via: 1.1 40ddfb9607f5d49c286c41e9afdce772.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Uij3cpBtl0ZJ_OwFFDSint5ab3Ayvn0okmhJekgtxI-etIN5l07sjg==
< 
* Ignoring the response-body
* Connection #0 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://www.somewebsite.com/'
* Found bundle for host www.somewebsite.com: 0x2ab53b0 [can pipeline]
*   Trying 99.84.194.113...
* Connected to www.somewebsite.com (99.84.194.113) port 443 (#1)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: www.somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: http://www.somewebsite.com/

< HTTP/1.1 302 Moved Temporarily
< Content-Type: text/plain
< Content-Length: 13
< Connection: keep-alive
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Location: https://somewebsite.com/
< Access-Control-Allow-Origin: *
< X-Cache: Miss from cloudfront
< Via: 1.1 74d35431a23bfc97a6055173d9be2dc4.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Bxg1W9zPN7U4i8GqysA11vj6h2dyDZdClyMUfUMfVUqd-v_mrQXGhQ==
< 
* Ignoring the response-body
* Connection #1 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://somewebsite.com/'
*   Trying 13.225.146.93...
* Connected to somewebsite.com (13.225.146.93) port 443 (#2)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: https://www.somewebsite.com/

< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 1218349
< Connection: keep-alive
< Vary: Accept-Encoding
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Server: nginx/1.16.1
< Expires: Wed, 26 Feb 2020 21:56:48 GMT
< Last-Modified: Wed, 26 Feb 2020 21:56:48 GMT
< Via: 1.1 varnish-v4, 1.1 a52dcb1fed052adbd58b868375961d24.cloudfront.net (CloudFront)
< Vary: Accept-Encoding
< Accept-Ranges: bytes
< Cache-Control: max-age=0, must-revalidate
< Set-Cookie: SWID=72B09DFD-D038-485C-C836-7229EB59F0B1; path=/; Expires=Sun, 26 Feb 2040 21:46:55 GMT; domain=somewebsite.com;
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Pop: LAX3-C4
< X-Amz-Cf-Id: JGF1k-OnDIZT_1DP5psnrlb9jmmp7rq69QbGNZL1CVGbjJWjORwpGQ==
< 
* Connection #2 to host somewebsite.com left intact

反正像curl一样自动添加“ Referer”标头吗?

注意:例如,如果要尝试使用,请将“ somewebsite”替换为“ abc”。

1 个答案:

答案 0 :(得分:1)

requests没有用于此任务的任何官方钩子。但是您可以子类requests.Session来包装为每个重定向调用的方法:Session.rebuild_auth()

在重定向时,我们可能希望从请求中剥离身份验证,以避免泄露凭证。此方法会在可能的情况下智能地删除并重新应用身份验证,以避免凭据丢失。

由于它是由下一个(准备好的)请求以及触发重定向的上一个响应调用的,因此它很适合添加Referer标头:

import requests

class RefererSession(requests.Session):
    def rebuild_auth(self, prepared_request, response):
        super().rebuild_auth(prepared_request, response)
        prepared_request.headers["Referer"] = response.url

然后将此子类用于所有请求:

with RefererSession() as session:
    r = session.post('http://www.somewebsite.com', allow_redirects=True)

使用https://httpbin.org进行演示:

>>> import requests
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> def echo_request_lines(msg, *rest):
...     """HTTPConnection debug print handler, writes out request lines"""
...     if msg != 'send:': return
...     request_lines = literal_eval(rest[0]).replace(b'\r', b'')
...     print(request_lines.rstrip().decode('latin1'))
...     print()
...
>>> http.client.HTTPConnection.debuglevel = 1
>>> http.client.print = echo_request_lines
>>> class RefererSession(requests.Session):
...     def rebuild_auth(self, prepared_request, response):
...         super().rebuild_auth(prepared_request, response)
...         prepared_request.headers["Referer"] = response.url
...
>>> with RefererSession() as session:
...     r = session.get('https://httpbin.org/redirect/2')
...
GET /redirect/2 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

GET /relative-redirect/1 HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/redirect/2

GET /get HTTP/1.1
Host: httpbin.org
User-Agent: python-requests/2.22.0
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Referer: https://httpbin.org/relative-redirect/1

>>> from pprint import pprint
>>> pprint(dict(r.history[1].request.headers))
{'Accept': '*/*',
 'Accept-Encoding': 'gzip, deflate',
 'Connection': 'keep-alive',
 'Referer': 'https://httpbin.org/redirect/2',
 'User-Agent': 'python-requests/2.22.0'}
>>> pprint(dict(r.request.headers))
{'Accept': '*/*',
 'Accept-Encoding': 'gzip, deflate',
 'Connection': 'keep-alive',
 'Referer': 'https://httpbin.org/relative-redirect/1',
 'User-Agent': 'python-requests/2.22.0'}