美丽的汤4不工作/一致

时间:2016-04-08 09:34:16

标签: beautifulsoup code-formatting

虽然我编写的脚本有效,但并非所有网站都返回了他们的标题(这就是我要追求的,获取网站的标题并将其打印回来)。谷歌工作的网站,但这个网站,StackOverflow等其他网站会产生错误。

这是我的代码:

    import urllib2
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(urllib2.urlopen("http://lxml.de"))
    print soup.title.string

如果你能为我做这些事情会很棒:)

  1. 如果可以对代码(和处理变量)进行任何改进
  2. 如何解决它不返回的问题(并处理任何错误)
  3. 代码按照惯例返回一个USERWARNING(当它实际工作时),说我应该在脚本之后添加一个特殊的“html.parser”,但是在我把它放入后它没有用。
  4. BTW,ERROR GIVEN(正如它吐出来的那样):

    Traceback (most recent call last):
      File "C:\Users\NAME\Desktop\NETWORK\personal work\PROGRAMMING\Python\bibli
    ography PYTHON\TEMP.py", line 5, in <module>
        soup = BeautifulSoup(urllib2.urlopen("http://stackoverflow.com/questions/364
    96222/beautiful-soup-4-not-working-consistent"))
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 154, in urlopen
        return opener.open(url, data, timeout)
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 437, in open
        response = meth(req, response)
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 550, in http_resp
    onse
        'http', request, response, code, msg, hdrs)
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 475, in error
        return self._call_chain(*args)
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 409, in _call_cha
    in
        result = func(*args)
      File "C:\Program Files (x86)\PYTHON 27\lib\urllib2.py", line 558, in http_erro
    r_default
        raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    urllib2.HTTPError: HTTP Error 403: Forbidden
    Press any key to continue . . .
    

2 个答案:

答案 0 :(得分:1)

我可以通过指定用户代理标头来实现此功能。我有一种感觉它与https vs http有关,但我担心我不完全确定原因是什么。

import urllib2
from bs4 import BeautifulSoup

site= "https://stackoverflow.com"
hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'}

req = urllib2.Request(site, headers=hdr)

try:
    soup = BeautifulSoup(urllib2.urlopen(req), "html.parser")
except urllib2.HTTPError, e:
    print e.fp.read()

print soup.title.string

另一个问题影响by this answer

答案 1 :(得分:0)

尝试this url library

pip install requests   

以下代码适用于我

import requests
from bs4 import BeautifulSoup
htmlresponse = requests.get("http://lxml.de/")
print htmlresponse.content