从html数据读取标签时出现UnicodeEncodeError

时间:2016-03-04 12:08:04

标签: python html python-3.x beautifulsoup

我正在尝试读取磁盘上的html文件。我正在使用BeautifulSoup来处理其中的标签。我的文件可以下载here。尝试打印<p>标记时,我收到此unicode错误。我甚至无法将<p>标签存储在另一个文件中。 :

Traceback (most recent call last):
  File "C:\Users\admin\Desktop\HTMLdownload\HTMLdownload\src\Extract images and caption.py", line 61, in <module>
    print(img_data)
  Traceback (most recent call last):
  File "C:\Users\admin\Desktop\HTMLdownload\HTMLdownload\src\Extract images and caption.py", line 57, in <module>
    print(cap_data)
  File "c:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2212' in position 330: character maps to <undefined>

这是我的代码:

loc # some location on disk
soup = BeautifulSoup(open(loc,'r'),"html.parser")
fig_data= soup.select("dl.figure")
for i in fig_data:
    img_data=i.select("img.figure")
    print(img_data)
    cap_data=i.select(".caption p")
    print(cap_data)

在此代码中,我试图获取图像标记及其各自的标题。从这些图像标签中我将获取图像的链接。

要解决,我已经尝试编码为utf-8或其他选项,如repr(cap_data),但我仍然使用Python 3收到此错误。

有问题的文字是:

<p id="">Weight change of <em>A. caliginosa</em> in pots containing Springmount soil kept at 15<sup>&deg;</sup>C (&plusmn;1&deg;C) for 10 weeks. Vertical bars represent standard errors of the means, <em>n</em>=10. Bars with same letters are not significantly different at the 5% level. C6=<em>A. caliginosa</em> at a density of 6 worms pot<sup>&minus;1</sup>. C12=<em>A. caliginosa</em> at a density of 12 worms pot<sup>&minus;1</sup>. CL=<em>A. caliginosa</em> and <em>A. longa</em> at a density of 6 worms pot<sup>&minus;1</sup> each.</p>

1 个答案:

答案 0 :(得分:0)

如追溯所示,您应该repr(img_data)。另请参阅有关该主题的other questions。唉,目前还没有一个规范的目标(到目前为止)。