我正在尝试抓取this网站上的所有页面。我写了这段代码:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import urllib
from urllib.request import urlopen
output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
if each_page%1000 == 0: #this is because of download limit
time.sleep(5) #this is because of download limit
url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
tabs = soup.find_all('table')
pd_list = pd.read_html(str(tabs[0]))
temp_list = []
for i in range(22):
temp_list.append(str(pd_list[0][2][i]).strip())
output.write(str(temp_list[1]).strip() + '\t' + str(temp_list[3]).strip() + '\t' + str(temp_list[7]).strip() + '\t' + str(temp_list[15]).strip() + '\t')
pd_list2 = pd.read_html(str(tabs[1]))
output.write(str(pd_list2[0][0][1]) + '\t' + str(pd_list2[0][2][1]) + '\n')
我的连接由于尝试URL的次数过多而被拒绝(我知道这是因为当我运行带有请求的代码而不是url.request.urlopen时,错误提示“超出URL的最大重试次数:”:>
requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1000&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ebc0e48>: Failed to establish a new connection: [Errno 61] Connection refused'))
其他建议here的方法也无效,该帖子的一位论坛用户建议我针对此问题撰写另一篇文章。
我已经研究过草率,但是我不太了解如何使其与上面的脚本链接。谁能告诉我如何编辑以上脚本,以便避免出现诸如以下的错误:
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.signalpeptide.de', port=443): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=2&listname= (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1114f0898>: Failed to establish a new connection: [Errno 61] Connection refused'
ConnectionRefusedError: [Errno 61] Connection refused
我也尝试使用urllib3:
from bs4 import BeautifulSoup
import requests
import pandas as pd
#import urllib
#from urllib.request import urlopen
import urllib3
http = urllib3.PoolManager()
output = open('signalpeptide.txt', 'a')
for each_page in range(1,220000):
if each_page%1000 == 0: #this is because of download limit
time.sleep(5) #this is because of download limit
url = 'http://www.signalpeptide.de/index.php?sess=&m=listspdb_bacteria&s=details&id=' + str(each_page) + '&listname='
page = http.request('GET',url)
soup = BeautifulSoup(page, 'html.parser')
出现错误:
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='www.signalpeptide.de', port=80): Max retries exceeded with url: /index.php?sess=&m=listspdb_bacteria&s=details&id=1&listname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11ce6f5f8>: Failed to establish a new connection: [Errno 61] Connection refused'))
请注意,我认为您是第一次运行此脚本,它将在我测试/编写它的同时运行几次,现在我已经编写它并且知道它可以工作。然后运行了前400个条目,然后出现了上面的错误,现在它根本不允许我运行。
如果有人对如何编辑此脚本以避开URL重试的最大次数有任何想法,尤其要注意,我已经收到连接被拒绝的错误,我将不胜感激。