Question

我正在尝试将代理实现到我的网络抓取工具中。没有代理，我的代码连接到网站没有问题，但是当我尝试添加代理时，突然它不会连接！看起来python-requests中的任何人都没有关于这个问题的帖子，所以我希望你们都可以帮助我！

背景信息：我正在使用Mac并在虚拟环境中使用Anaconda的Python 3.4。

这是我的代码无代理

proxyDict = {'http': 'http://10.10.1.10:3128'}

def pmc_spider(max_pages, pmid): 
    start = 1

    titles_list = []
    url_list = []
    url_keys = []

    while start <= max_pages:
        url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)

        req = requests.get(url) #this works
        plain_text = req.text
        soup = BeautifulSoup(plain_text, "lxml")

        for items in soup.findAll('div', {'class': 'title'}):
            title = items.get_text()
            titles_list.append(title)

            for link in items.findAll('a'):
                urlkey = link.get('href')
                url_keys.append(urlkey)   #url = base + key
                url =  "http://www.ncbi.nlm.nih.gov"+str(urlkey)
                url_list.append(url)

        start += 1
    return titles_list, url_list, authors_list

根据我正在查看的其他帖子，我应该能够取代它：

req = requests.get(url)

用这个：

req = requests.get(url, proxies=proxyDict, timeout=2)

但这不起作用！ :(如果我用这行代码运行它终端给我一个TimeOut错误

socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError:       (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')

然后我在终端上打印了一些不同的痕迹，但是同样的错误：

 During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
 File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))

为什么在我的代码中添加代理会突然导致我超时？我在几个随机网址上尝试了它并且发生了同样的事情。所以它似乎是代理问题而不是我的代码问题。但是，我现在必须使用代理，所以我需要找到它的根并修复它。我还从我使用的VPN尝试了几个不同的代理IP地址，所以我知道IP地址是有效的。

非常感谢你的帮助！谢谢！

Answer 1

您似乎需要使用能够响应请求的http或https代理。

代码中的10.10.1.10:3128似乎来自requests documentation

中的示例

从http://proxylist.hidemyass.com/search-1291967列表中的代理（可能不是最佳来源），您的proxyDict应如下所示：{'http' : 'http://209.242.141.60:8080'}

在命令行上测试它似乎工作正常：

>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>

我使用代理时计时

1 个答案: