Question

我正在尝试抓取this网站上的所有页面。我写了这段代码：

from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib
from urllib.request import urlopen

output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
    if each_page%1000 == 0: #this is because of download limit
            time.sleep(5) #this is because of download limit

    url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='    
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    tabs = soup.find_all('table')
    pd_list = pd.read_html(str(tabs[0]))
    temp_list = []
    for i in range(22):
        temp_list.append(str(pd_list[0][2][i]).strip())

    output.write(str(temp_list[1]).strip() + '\t' + str(temp_list[3]).strip() + '\t' + str(temp_list[7]).strip() + '\t' + str(temp_list[15]).strip() + '\t')
    pd_list2 =  pd.read_html(str(tabs[1]))
    output.write(str(pd_list2[0][0][1]) + '\t' + str(pd_list2[0][2][1]) + '\n')

我的连接由于尝试URL的次数过多而被拒绝（我知道这是因为当我运行带有请求的代码而不是url.request.urlopen时，错误提示“超出URL的最大重试次数：”：

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ebc0e48>: Failed to establish a new connection: [Errno 61] Connection refused'))

其他建议here的方法也无效，该帖子的一位论坛用户建议我针对此问题撰写另一篇文章。

我已经研究过草率，但是我不太了解如何使其与上面的脚本链接。谁能告诉我如何编辑以上脚本，以便避免出现诸如以下的错误：

urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.signalpeptide.de', port=443): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused'

ConnectionRefusedError: [Errno 61] Connection refused

我也尝试使用urllib3：

from bs4 import BeautifulSoup
import requests
import pandas as pd
#import urllib
#from urllib.request import urlopen
import urllib3

http = urllib3.PoolManager()
output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
    if each_page%1000 == 0: #this is because of download limit
            time.sleep(5) #this is because of download limit

    url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='    
    page = http.request('GET',url)
    soup = BeautifulSoup(page, 'html.parser')

出现错误：

    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ce6f5f8>: Failed to establish a new connection: [Errno 61] Connection refused'))

请注意，我认为您是第一次运行此脚本，它将在我测试/编写它的同时运行几次，现在我已经编写它并且知道它可以工作。然后运行了前400个条目，然后出现了上面的错误，现在它根本不允许我运行。

如果有人对如何编辑此脚本以避开URL重试的最大次数有任何想法，尤其要注意，我已经收到连接被拒绝的错误，我将不胜感激。

urllib.request.urlopen连接错误达到最大URL重试次数：如何绕过此网址抓取网站？

0 个答案: