在Python中获取url的html代码时出现错误410(“资源不再可用”)

时间:2018-03-28 09:10:47

标签: python url html-parsing

我正在尝试获取以下链接的HTML:

http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html

为此,我按照以下步骤进行:

import requests
try: 
     from BeautifulSoup import BeautifulSoup
except ImportError:
    from bs4 import BeautifulSoup

url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html'
html=requests.get(url) 

我得到的html代码(print(html.text))如下:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head> 
<title>410 Gone</title>
</head><body>
<h1>Gone</h1>
<p>The requested resource
<br />/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html<br />
is no longer available on this server and there is no forwarding address.
Please remove all references to this resource.</p>
</body></html>

当链接确实存在及其内容时,我真的不明白为什么。事实上,如果我去链接并检查那里html与我得到的那个不同。我怎么能得到实际的文字内容?

提前谢谢

1 个答案:

答案 0 :(得分:1)

服务器似乎对哪个用户代理正在访问资源很挑剔。您可以使用headers的{​​{1}}参数设置自己的用户代理:

requests.get()

服务器拒绝包含子串的请求,例如&#34; curl&#34;,&#34; python&#34;,&#34; wget&#34; import requests url = 'http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html' headers = {'User-Agent': 'whatever'} >>> r = requests.get(url) >>> r <Response [410]> >>> r = requests.get(url, headers=headers) >>> r <Response [200]> 标题中的等等。