Question

虽然我编写的脚本有效，但并非所有网站都返回了他们的标题（这就是我要追求的，获取网站的标题并将其打印回来）。谷歌工作的网站，但这个网站，StackOverflow等其他网站会产生错误。

这是我的代码：

    import urllib2
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(urllib2.urlopen("http://lxml.de"))
    print soup.title.string

如果你能为我做这些事情会很棒：）

如果可以对代码（和处理变量）进行任何改进
如何解决它不返回的问题（并处理任何错误）
代码按照惯例返回一个USERWARNING（当它实际工作时），说我应该在脚本之后添加一个特殊的“html.parser”，但是在我把它放入后它没有用。

BTW，ERROR GIVEN（正如它吐出来的那样）：

Traceback (most recent call last):
  File "C:\Users\NAME\Desktop\NETWORK\personal work\PROGRAMMING\Python\bibli
ography PYTHON\TEMP.py", line 5, in <module>
    soup = BeautifulSoup(urllib2.urlopen("http://stackoverflow.com/questions/364
96222/beautiful-soup-4-not-working-consistent"))
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 437, in open
    response = meth(req, response)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 550, in http_resp
onse
    'http', request, response, code, msg, hdrs)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 475, in error
    return self._call_chain(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 409, in _call_cha
in
    result = func(*args)
  File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 558, in http_erro
r_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
Press any key to continue . . .

Answer 1

我可以通过指定用户代理标头来实现此功能。我有一种感觉它与https vs http有关，但我担心我不完全确定原因是什么。

import urllib2
from bs4 import BeautifulSoup

site= "https://stackoverflow.com"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

req = urllib2.Request(site, headers=hdr)

try:
    soup = BeautifulSoup(urllib2.urlopen(req), "html.parser")
except urllib2.HTTPError, e:
    print e.fp.read()

print soup.title.string

另一个问题影响by this answer。

Answer 2

尝试this url library

pip install requests

以下代码适用于我

import requests
from bs4 import BeautifulSoup
htmlresponse = requests.get("http://lxml.de/")
print htmlresponse.content

美丽的汤4不工作/一致

2 个答案: