Python Tor:无法满足请求/请求被阻止

时间:2018-08-09 01:18:25

标签: python web-scraping beautifulsoup python-requests tor

我正在尝试使用Tor从下面的链接发出请求,但返回错误。在没有Tor的情况下发出请求可以很好地工作,但我仍然需要使用Tor或随机IP。

我这样做正确吗?或对此有更好的解决方案。

link = 'https://www.totallylegal.com/searchjobs/'
import requests
torport = 9050
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}
proxies = {
    'http': "socks5h://localhost:{}".format(torport),
    'https': "socks5h://localhost:{}".format(torport)
}

print(requests.get(link,headers=headers, proxies=proxies).content)

下面是显示的错误:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>403 ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.

<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: iXaDPfPtyHg0TGTFJvYuAnV86unJIpBITxdBJ2w_i_bo-ToR510p2w==
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

1 个答案:

答案 0 :(得分:0)

该页面似乎阻止了Tor Ip的收件人,因此我们可以通过另一个网站来规避此问题,例如W3验证程序,向我们显示源代码:https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fwww.totallylegal.com%2Fsearchjobs%2F

我们仍在使用TOR,但让其他站点为我们获取该站点(并且其IP未被阻止):

from bs4 import BeautifulSoup
import requests

proxies = {
    'http': 'http://<YOUR PROXY ADDRESS>:<YOUR PROXY PORT>',
    'https': 'http://<YOUR PROXY ADDRESS>:<YOUR PROXY PORT>',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    'accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
}

r = requests.get('https://validator.w3.org/nu/?showsource=yes&doc=https%3A%2F%2Fwww.totallylegal.com%2Fsearchjobs%2F', proxies=proxies, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
source_code = ''
for code in soup.select('ol.source > li > code'):
    if 'class' in code.attrs and 'lf' in code.attrs['class']:
        source_code += '\n'
    else:
        source_code += code.text

soup2 = BeautifulSoup(source_code, 'lxml')

for li in soup2.select('li.lister__item h3'):
    print(li.text)
    print('-' * 80)

打印:

Corporate Partner
--------------------------------------------------------------------------------
Personal Injury Paralegal
--------------------------------------------------------------------------------
Healthcare Regulatory Lawyer - London
--------------------------------------------------------------------------------
Company Secretary and Corporate Governance
--------------------------------------------------------------------------------
Junior FCPA/Compliance Associate, Beijing - 14612/TTL
--------------------------------------------------------------------------------
International Project Manager, Shanghai - 14611/TTL
--------------------------------------------------------------------------------
Corporate Associate (4+ PQE) Beijing - 14610/TTL
--------------------------------------------------------------------------------
Corporate Associate (5+ PQE) Shanghai - 14609/TTL
--------------------------------------------------------------------------------
Corporate or Commercial Counsel -Pharma- Surrey
--------------------------------------------------------------------------------
Corporate/Public M&A PSL, 5+ PQE
--------------------------------------------------------------------------------
Solicitor
--------------------------------------------------------------------------------
In-house Legal Counsel - Excellent opportunity to go In-House!
--------------------------------------------------------------------------------
Real Estate Partner
--------------------------------------------------------------------------------
Child Brain Injury Solicitor
--------------------------------------------------------------------------------
Corporate/Commercial In-House Lawyer, 1+
--------------------------------------------------------------------------------
In-house Regulatory Counsel, Banking/Payments, 5+
--------------------------------------------------------------------------------
In-house Property Finance/Banking Lawyer, 1-3
--------------------------------------------------------------------------------
Hybrid Legal & Compliance Data Protection Manager
--------------------------------------------------------------------------------
Hedge Fund Legal Counsel 3-5 years PQE
--------------------------------------------------------------------------------
Corporate PSL
--------------------------------------------------------------------------------