通过bs4打印抓取的网页时出错

时间:2015-01-07 10:24:24

标签: python python-3.x web-scraping beautifulsoup web-crawler

代码:

import requests
import urllib
from bs4 import BeautifulSoup

page1 = urllib.request.urlopen("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1)
print(soup.get_text())
print(soup.prettify())

错误:

 Traceback (most recent call last):
  File "C:\Users\sony\Desktop\Trash\Crawler Try\try2.py", line 9, in <module>
    print(soup.get_text())
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u014d' in position 10487: character maps to <undefined>

我认为问题主要在于urlib包。在这里我使用urllib3包。他们将urlopen语法从2更改为3版本,这可能是错误的原因。但话虽如此,我只包含了最新的语法。 Python 3.4版

3 个答案:

答案 0 :(得分:2)

因为您要导入requests,所以您可以使用它而不是像这样的urllib:

import requests
from bs4 import BeautifulSoup

page1 = requests.get("http://en.wikipedia.org/wiki/List_of_human_stampedes")
soup = BeautifulSoup(page1.text)
print(soup.get_text())
print(soup.prettify())

你的问题是python无法编码你正在抓取的页面中的字符。有关更多信息,请参阅此处:https://stackoverflow.com/a/16347188/2638310

由于维基百科页面是UTF-8,似乎BeautifulSoup正在猜测编码错误。尝试在代码中传递from_encoding参数,如下所示:

soup = BeautifulSoup(page1.text, from_encoding="UTF-8")

有关BeautifulSoup中编码的更多信息,请查看此处:http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

答案 1 :(得分:0)

我使用的是Python2.7,因此我在urllib模块中没有request方法。

#!/usr/bin/python3
# coding: utf-8

import requests
from bs4 import BeautifulSoup

URL = "http://en.wikipedia.org/wiki/List_of_human_stampedes"
soup = BeautifulSoup(requests.get(URL).text)
print(soup.get_text())
print(soup.prettify())

https://www.python.org/dev/peps/pep-0263/

答案 2 :(得分:0)

将这些打印行放在Try-Catch块中,如果有非法字符,则不会出现错误。

try:
   print(soup.get_text())
   print(soup.prettify())
except Exception:
   print(str(soup.get_text().encode("utf-8")))
   print(str(soup.prettify().encode("utf-8")))