Question

我正在尝试使用python 3抓取以下页面，但我一直得到HTTP Error 400: Bad Request。我已经看过一些先前的建议使用urllib.quote的答案，由于它是python 2，因此对我不起作用。此外，我尝试了另一篇文章中建议的以下代码，但仍然无效。 >

url = requote_uri('http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01')
with urllib.request.urlopen(url) as response:
  html = response.read()

Answer 1

服务器拒绝来自非人类的User-Agent HTTP header的查询。

只需选择浏览器的User-Agent字符串并将其设置为查询的标头即可：

import urllib.request

url = 'http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01'
headers={
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
}

request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
    html = response.read()

尝试使用Python 3抓取页面的错误请求

1 个答案: