虽然我编写的脚本有效,但并非所有网站都返回了他们的标题(这就是我要追求的,获取网站的标题并将其打印回来)。谷歌工作的网站,但这个网站,StackOverflow等其他网站会产生错误。
这是我的代码:
import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("http://lxml.de"))
print soup.title.string
如果你能为我做这些事情会很棒:)
BTW,ERROR GIVEN(正如它吐出来的那样):
Traceback (most recent call last):
File "C:\Users\NAME\Desktop\NETWORK\personal work\PROGRAMMING\Python\bibli
ography PYTHON\TEMP.py", line 5, in <module>
soup = BeautifulSoup(urllib2.urlopen("http://stackoverflow.com/questions/364
96222/beautiful-soup-4-not-working-consistent"))
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 437, in open
response = meth(req, response)
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 550, in http_resp
onse
'http', request, response, code, msg, hdrs)
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 475, in error
return self._call_chain(*args)
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 409, in _call_cha
in
result = func(*args)
File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 558, in http_erro
r_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Press any key to continue . . .
答案 0 :(得分:1)
我可以通过指定用户代理标头来实现此功能。我有一种感觉它与https vs http有关,但我担心我不完全确定原因是什么。
import urllib2
from bs4 import BeautifulSoup
site= "https://stackoverflow.com"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}
req = urllib2.Request(site, headers=hdr)
try:
soup = BeautifulSoup(urllib2.urlopen(req), "html.parser")
except urllib2.HTTPError, e:
print e.fp.read()
print soup.title.string
另一个问题影响by this answer。
答案 1 :(得分:0)
pip install requests
以下代码适用于我
import requests
from bs4 import BeautifulSoup
htmlresponse = requests.get("http://lxml.de/")
print htmlresponse.content