Python,将HTML实体转换为Unicode

时间:2013-10-08 02:22:34

标签: python unicode

(编辑:我使用的是Python 2.7) (编辑2:我已经检查过Convert XML/HTML Entities into Unicode String in Python,解决方案不起作用。请不要将此标记为已经回答。)

我一直无法找到能够可靠地转换带有某些html实体的文本的python包。我发现HTMLParser适用于某些东西,但也打破了很多。 BeautifulSoup似乎永远不会用于转换为unicode。如何只使用一种方法返回字符串a-d的unicode表示?

我认为我遇到的问题是我的一些文本同时包含unicode字符和html实体(如示例字符串d)。

import HTMLParser
from bs4 import BeautifulSoup

astring = "P&O."
bstring = "& "
cstring = ">"
dstring = "> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup(astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup(bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup(cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup(dstring)
try: d2 = pars.unescape(dstring)
except:d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

给出以下输出:

a1: P&O.
a2: P&O.
b1: & 
b2: & 
c1: >
c2: >
d1: > 150ÎC
d2: HTML Parse Broke!

编辑3:kalhartt的建议引导我找到解决方案。为了防止使用混合字符编码的字符串,我使用.decode(' utf-8')

1 个答案:

答案 0 :(得分:1)

如果要处理unicode,请使用unicode字符串。在你的例子中,一切都按预期工作。

# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup

astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"

pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2

这给出了以下输出。

a1: <span>P&amp;O.</span>
a2: P&O.
b1: <span>&amp; </span>
b2: & 
c1: <span>&gt;</span>
c2: >
d1: <span>&gt; 150ÎC</span>
d2: > 150ÎC

BeautifulSoup对它们进行编码,HTMLParser对它们进行解码。