我正在尝试使用urllib和BeautifulSoup来抓取表格,我收到错误:
“urllib.error.HTTPError:HTTP错误302:HTTP服务器返回了导致无限循环的重定向错误。最后30x错误消息是:发现”
我听说这与需要Cookie的网站有关,但在第二次尝试后我仍然收到此错误:
import urllib.request
from bs4 import BeautifulSoup
import re
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
file = opener.open(testURL).read().decode()
soup = BeautifulSoup(file)
tables = soup.find_all('tr',{'style': re.compile("color:#4A3C8C")})
print(tables)
答案 0 :(得分:1)
fiew建议:
HTTPCookieProcessor
。 'Mozilla/5.0'
并将继续重定向。 HTTPError
捕获此类例外情况。 opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor())
user_agent = 'Mozilla/5.0 (Windows NT 6.1; rv:54.0) Gecko/20100101 Firefox/54.0'
opener.addheaders = [('user-agent', user_agent)]
try:
response = opener.open(testURL)
except urllib.error.HTTPError as e:
print(e)
except Exception as e:
print(e)
else:
file = response.read().decode()
soup = BeautifulSoup(file, 'html.parser')
... etc ...