我正在尝试将代理实现到我的网络抓取工具中。没有代理,我的代码连接到网站没有问题,但是当我尝试添加代理时,突然它不会连接!看起来python-requests中的任何人都没有关于这个问题的帖子,所以我希望你们都可以帮助我!
背景信息:我正在使用Mac并在虚拟环境中使用Anaconda的Python 3.4。
这是我的代码无代理
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
根据我正在查看的其他帖子,我应该能够取代它:
req = requests.get(url)
用这个:
req = requests.get(url, proxies=proxyDict, timeout=2)
但这不起作用! :(如果我用这行代码运行它终端给我一个TimeOut错误
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
然后我在终端上打印了一些不同的痕迹,但是同样的错误:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
为什么在我的代码中添加代理会突然导致我超时?我在几个随机网址上尝试了它并且发生了同样的事情。所以它似乎是代理问题而不是我的代码问题。但是,我现在必须使用代理,所以我需要找到它的根并修复它。我还从我使用的VPN尝试了几个不同的代理IP地址,所以我知道IP地址是有效的。
非常感谢你的帮助!谢谢!
答案 0 :(得分:0)
您似乎需要使用能够响应请求的http或https代理。
代码中的10.10.1.10:3128
似乎来自requests documentation
从http://proxylist.hidemyass.com/search-1291967列表中的代理(可能不是最佳来源),您的proxyDict应如下所示:{'http' : 'http://209.242.141.60:8080'}
在命令行上测试它似乎工作正常:
>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>