我正在尝试获取以下链接的HTML:
http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html
为此,我按照以下步骤进行:
import requests
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html'
html=requests.get(url)
我得到的html代码(print(html.text)
)如下:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>410 Gone</title>
</head><body>
<h1>Gone</h1>
<p>The requested resource
<br />/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html<br />
is no longer available on this server and there is no forwarding address.
Please remove all references to this resource.</p>
</body></html>
当链接确实存在及其内容时,我真的不明白为什么。事实上,如果我去链接并检查那里html与我得到的那个不同。我怎么能得到实际的文字内容?
提前谢谢
答案 0 :(得分:1)
服务器似乎对哪个用户代理正在访问资源很挑剔。您可以使用headers
的{{1}}参数设置自己的用户代理:
requests.get()
服务器拒绝包含子串的请求,例如&#34; curl&#34;,&#34; python&#34;,&#34; wget&#34; import requests
url = 'http://www.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html'
headers = {'User-Agent': 'whatever'}
>>> r = requests.get(url)
>>> r
<Response [410]>
>>> r = requests.get(url, headers=headers)
>>> r
<Response [200]>
标题中的等等。