Question

我正在尝试使用beautifulsoup进行网页抓取。起初运行得很好，但是当我再次运行相同的代码时发生了错误。

然后我使用pd.read_html而不是beautifulsoup进行Web抓取，但有时会发生相同的连接错误。

我尝试的代码：

link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = urllib.urlopen(link)
soup = BeautifulSoup(f,'html.parser')
pf = pd.read_html(link)[0]

错误消息：

[错误号10061]由于目标机器而无法建立连接积极拒绝

Answer 1

如果您同时访问的某些网站不在您应该通过连接访问的网站类别之内，则服务器将无法访问该网站。您仍然可以使用VPN进行操作。\

代替requests，而不是urllib。使用pip install requests

安装

import requests    

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
f = requests.get(link, headers=headers)
soup = bs(f.text,'html.parser')
th = [i.text.strip() for i in  (soup.find_all('th'))]
td =  [i.text for i in (soup.find_all('td'))]
print(th, td)

您的pandas代码非常好，只是不要在urllib中使用它。如果您遇到相同的错误，请在使用python的sleep模块抓取此页面的并发请求中加入一些延迟。

例如

import time
import pandas

while True:
    link = 'https://www.twse.com.tw/block/BFIAUU?response=html&date=20190702&selectType=S'
    pf = pd.read_html(link)[0:10]
    print(pf)
    time.sleep(1)  # delays for 1 second

网页抓取时偶尔出现连接错误（10061）

1 个答案: