超过url的最大重试次数

时间:2019-10-09 07:36:17

标签: web-crawler

每当我尝试访问该网站时,都会显示以下错误:

ConnectionError: HTTPSConnectionPool(host='blogs.deloitte.com', port=443): Max retries exceeded with url: /centerforhealthsolutions/category/pharma/ (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x0000020FE5A43278>: Failed to establish a new connection: [WinError 10060]

 A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))

使用时间戳提供延迟。

代码

import requests
from bs4 import BeautifulSoup
import re
#taking the list for the pages,link and title
href=[]
title=[]

page = requests.get('https://www.raps.org/news-and-articles?rss=Regulatory-Focus',verify=False)
soup = BeautifulSoup(page.content, 'html.parser')
for data in soup.find_all('channel'):
#string =data.text
  url1 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-@.&+]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', data.text)
print (url1)
for data in soup.find_all('title'):
   title.append(data.text)
#data.split(',')
#print (title)
#print (title)
#print (href)

#Importig the pandas
import pandas as pd
#creating a data FRame
df=pd.DataFrame()
#Adding the columns to data Frame
df['Href_link']=url1
df['Title']=title
print(df)
#Exporting the data to excel
df.to_excel('C:/Users/Chetna/Documents/Raps_regulatory_focus.xlsx')

在公司网络外部运行正常。

0 个答案:

没有答案