(编辑:我使用的是Python 2.7) (编辑2:我已经检查过Convert XML/HTML Entities into Unicode String in Python,解决方案不起作用。请不要将此标记为已经回答。)
我一直无法找到能够可靠地转换带有某些html实体的文本的python包。我发现HTMLParser适用于某些东西,但也打破了很多。 BeautifulSoup似乎永远不会用于转换为unicode。如何只使用一种方法返回字符串a-d的unicode表示?
我认为我遇到的问题是我的一些文本同时包含unicode字符和html实体(如示例字符串d)。
import HTMLParser
from bs4 import BeautifulSoup
astring = "P&O."
bstring = "& "
cstring = ">"
dstring = "> 150ÎC"
pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup(astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup(bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup(cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup(dstring)
try: d2 = pars.unescape(dstring)
except:d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2
给出以下输出:
a1: P&O.
a2: P&O.
b1: &
b2: &
c1: >
c2: >
d1: > 150ÎC
d2: HTML Parse Broke!
编辑3:kalhartt的建议引导我找到解决方案。为了防止使用混合字符编码的字符串,我使用.decode(' utf-8')
答案 0 :(得分:1)
如果要处理unicode,请使用unicode字符串。在你的例子中,一切都按预期工作。
# -*- coding: utf-8 -*-
import HTMLParser
from bs4 import BeautifulSoup
astring = u"P&O."
bstring = u"& "
cstring = u">"
dstring = u"> 150ÎC"
pars = HTMLParser.HTMLParser()
a1 = BeautifulSoup('<span>%s</span>' % astring)
a2 = pars.unescape(astring)
print "a1:", a1
print "a2:", a2
b1 = BeautifulSoup('<span>%s</span>' % bstring)
b2 = pars.unescape(bstring)
print "b1:", b1
print "b2:", b2
c1 = BeautifulSoup('<span>%s</span>' % cstring)
c2 = pars.unescape(cstring)
print "c1:", c1
print "c2:", c2
d1 = BeautifulSoup('<span>%s</span>' % dstring)
try: d2 = pars.unescape(dstring)
except: d2 = "HTML Parse Broke!"
print "d1:", d1
print "d2:", d2
这给出了以下输出。
a1: <span>P&O.</span>
a2: P&O.
b1: <span>& </span>
b2: &
c1: <span>></span>
c2: >
d1: <span>> 150ÎC</span>
d2: > 150ÎC
BeautifulSoup对它们进行编码,HTMLParser对它们进行解码。