Question

我正在尝试解析网站的HTML数据。我写了这段代码：

import urllib.request

def parse(url):
    response = urllib.request.urlopen(url)
    html = response.read()
    strHTML = html.decode()
    return strHTML

website = "http://www.manarat.ac.bd/"
string = parse(website)

但它显示此错误：

追踪（最近一次通话）：文件“C：\ Users \ pupewekate \ Videos \ RAW \ 2.py”，第11行，中 string = parse（网站）文件“C：\ Users \ pupewekate \ Videos \ RAW \ 2.py”，第5行，在解析中 response = urllib.request.urlopen（url）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第223行，在urlopen中返回opener.open（url，data，timeout）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第532行，开放式响应= meth（req，response）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第642行，在http_response'htt'中，请求，响应，代码，消息， HDRS）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第570行，错误返回＆gt; self._call_chain（*参数）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第504行，在_call_chain中结果= func（* args）文件 “C：\用户\ pupewekate \应用程序数据\本地\程序\ Python的\ Python36-32 \ LIB \的urllib \ request.py” 第650行，http_error_default引发HTTPError（req.full_url，代码， msg，hdrs，fp）urllib.error.HTTPError：HTTP错误412：前提条件失败

任何解决方案？

Answer 1

此网站检查用户代理标头。如果它没有识别其值，则返回状态码412：

import requests

print(requests.get('http://www.manarat.ac.bd/'))
# <Response [412]>

print(requests.get('http://www.manarat.ac.bd/', headers={'User-Agent': 'Chrome'}))
# <Response [200]>

有关如何在urlib中设置用户代理的信息，请参阅this answer。

Answer 2

您可以使用请求模块，因为它更容易实现，否则如果您决定使用urllib，则可以使用：

import urllib

def parse(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3;Win64;x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
    response = urllib.request.urlopen(url,headers=headers)
    print response

website = "http://www.manarat.ac.bd/"
string = parse(website)

Python 3 urllib HTTP错误412：前提条件失败

2 个答案: