我正面临像°和®等特殊字符的问题,这些字符代表华氏度符号和注册符号,
当我打印包含特殊字符的字符串时,它给出如下输出:
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
有没有办法输出确切的字符而不是代码?请告诉我。
答案 0 :(得分:8)
$ python -c'from BeautifulSoup import BeautifulSoup
> print BeautifulSoup("""<html>Preheat oven to 350° F
> Welcome to Lorem Ipsum Inc®""",
> convertEntities=BeautifulSoup.HTML_ENTITIES).contents[0].string'
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
答案 1 :(得分:2)
这是一个用于容忍从网页中取消HTML引用的脚本 - 它假定引用是例如以°
格式,但后面加了分号(例如Preheat oven to 350° F
):
from htmlentitydefs import name2codepoint
# Get the whitespace characters
nums_dict = {0: ' ', 1: '\t', 2: '\r', 3: '\n'}
chars_dict = dict((x, y) for y, x in nums_dict.items())
nums_dict2XML = {0: ' ', 1: '	', 2: ' ', 3: ' '}
chars_dict2XML = dict((nums_dict[i], nums_dict2XML[i]) for i in nums_dict2XML)
s = '1234567890ABCDEF'
hex_dict = {}
for i in s:
hex_dict[i.lower()] = None
hex_dict[i.upper()] = None
del s
def is_hex(s):
if not s:
return False
for i in s:
if i not in hex_dict:
return False
return True
class Unescape:
def __init__(self, s, ignore_whitespace=False):
# Converts HTML character references into a unicode string to allow manipulation
self.s = s
self.ignore_whitespace = ignore_whitespace
self.lst = self.process(ignore_whitespace)
def process(self, ignore_whitespace):
def get_char(c):
if ignore_whitespace:
return c
else:
if c in chars_dict:
return chars_dict[c]
else: return c
r = []
lst = self.s.split('&')
xx = 0
yy = 0
for item in lst:
if xx:
split = item.split(';')
if split[0].lower() in name2codepoint:
# A character reference, e.g. '&'
a = unichr(name2codepoint[split[0].lower()])
r.append(get_char(a)) # TOKEN CHECK?
r.append(';'.join(split[1:]))
elif split[0] and split[0][0] == '#' and split[0][1:].isdigit():
# A character number e.g. '4'
a = unichr(int(split[0][1:]))
r.append(get_char(a))
r.append(';'.join(split[1:]))
elif split[0] and split[0][0] == '#' and split[0][1:2].lower() == 'x' and is_hex(split[0][2:]):
# A hexadecimal encoded character
a = unichr(int(split[0][2:].lower(), 16)) # Hex -> base 16
r.append(get_char(a))
r.append(';'.join(split[1:]))
else:
r.append('&%s' % ';'.join(split))
else:
r.append(item)
xx += 1
yy += len(r[-1])
return r
def get_value(self):
# Convert back into HTML, preserving
# whitespace if self.ignore_whitespace is `False`
r = []
for i in self.lst:
if type(i) == int:
r.append(nums_dict2XML[i])
else:
r.append(i)
return ''.join(r)
def unescape(s):
# Get the string value from escaped HTML `s`, ignoring
# explicit whitespace like tabs/spaces etc
inst = Unescape(s, ignore_whitespace=True)
return ''.join(inst.lst)
if __name__ == '__main__':
print unescape('Preheat oven to 350° F')
print unescape('Welcome to Lorem Ipsum Inc®')
编辑:这是一个更简单的解决方案,它只用字符而不是&#xx;
引用替换字符引用:
from htmlentitydefs import name2codepoint
def unescape(s):
for name in name2codepoint:
s = s.replace('&%s;' % name, unichr(name2codepoint[name]))
return s
print unescape('Preheat oven to 350° F')
print unescape('Welcome to Lorem Ipsum Inc®')
答案 2 :(得分:1)
美丽的汤4:
my_text = """Preheat oven to 350° F
Welcome to Lorem Ipsum Inc® """
soup = BeautifulSoup(my_text, 'html.parser')
print(soup)
结果:
Preheat oven to 350° F
Welcome to Lorem Ipsum Inc®
答案 3 :(得分:0)
我想某个地方,一个节目引用了&amp; deg和&amp; reg而没有分号。 尝试使用“&amp; deg”+“;”和“&amp; reg”+“;”在HTML文件中,如果它确实是HTML文件。 请解释一下背景。