我正在尝试读取磁盘上的html文件。我正在使用BeautifulSoup来处理其中的标签。我的文件可以下载here。尝试打印<p>
标记时,我收到此unicode错误。我甚至无法将<p>
标签存储在另一个文件中。 :
Traceback (most recent call last):
File "C:\Users\admin\Desktop\HTMLdownload\HTMLdownload\src\Extract images and caption.py", line 61, in <module>
print(img_data)
Traceback (most recent call last):
File "C:\Users\admin\Desktop\HTMLdownload\HTMLdownload\src\Extract images and caption.py", line 57, in <module>
print(cap_data)
File "c:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2212' in position 330: character maps to <undefined>
这是我的代码:
loc # some location on disk
soup = BeautifulSoup(open(loc,'r'),"html.parser")
fig_data= soup.select("dl.figure")
for i in fig_data:
img_data=i.select("img.figure")
print(img_data)
cap_data=i.select(".caption p")
print(cap_data)
在此代码中,我试图获取图像标记及其各自的标题。从这些图像标签中我将获取图像的链接。
要解决,我已经尝试编码为utf-8或其他选项,如repr(cap_data),但我仍然使用Python 3收到此错误。
有问题的文字是:
<p id="">Weight change of <em>A. caliginosa</em> in pots containing Springmount soil kept at 15<sup>°</sup>C (±1°C) for 10 weeks. Vertical bars represent standard errors of the means, <em>n</em>=10. Bars with same letters are not significantly different at the 5% level. C6=<em>A. caliginosa</em> at a density of 6 worms pot<sup>−1</sup>. C12=<em>A. caliginosa</em> at a density of 12 worms pot<sup>−1</sup>. CL=<em>A. caliginosa</em> and <em>A. longa</em> at a density of 6 worms pot<sup>−1</sup> each.</p>