将HTML实体转换为Unicode,反之亦然

时间:2009-03-31 15:54:25

标签: python html html-entities

  

可能重复:

     

如何在Python中将HTML实体转换为Unicode,反之亦然?

9 个答案:

答案 0 :(得分:88)

关于“反之亦然”(我需要自己,引导我找到这个问题,这没有帮助,随后another site which had the answer):

u'some string'.encode('ascii', 'xmlcharrefreplace')

将返回一个纯字符串,其中任何非ascii字符都转换为XML(HTML)实体。

答案 1 :(得分:28)

您需要BeautifulSoup

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

答案 2 :(得分:19)

Python 2.7和BeautifulSoup4的更新

Unescape - 使用htmlparser(Python 2.7标准库)unicode的Unicode HTML:

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape - 使用bs4(BeautifulSoup4)进行unicode的Unicode HTML:

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Escape - 使用bs4(BeautifulSoup4)将Unicode解码为HTML:

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

答案 3 :(得分:8)

正如 hekevintran 回答建议的那样,您可以使用cgi.escape(s)来编码stings,但请注意,默认情况下,在该函数中,quote的编码为false,并且通过它可能是个好主意。您的字符串旁边有quote=True个关键字参数。但即使通过quote=True,函数也不会转义单引号("'")(由于这些问题,函数自版本3.2以来一直是deprecated

建议使用html.escape(s)代替cgi.escape(s)。 (版本3.2中的新功能)

此外html.unescape(s)introduced in version 3.4

所以在python 3.4中你可以:

  • 使用html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()将特殊字符转换为HTML实体。
  • html.unescape(text)用于将HTML实体转换回纯文本表示。

答案 4 :(得分:1)

我使用以下函数将从xls文件中翻录的unicode转换为html文件,同时保留xls文件中的特殊字符:

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

希望这对某人有用

答案 5 :(得分:1)

如果像我这样的人在那里想知道为什么像&#153; (for trademark symbol), &#128; (for euro symbol)这样的实体数字(代码)没有被正确编码,原因是在ISO-8859-1(又名Windows-1252)中没有定义那些字符。

另请注意,html5的默认字符集为utf-8,html4为ISO-8859-1

因此,我们必须以某种方式解决(首先找到并替换那些)

Mozilla文档中的参考(起点)

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings

答案 6 :(得分:1)

对于python3,请使用html.unescape()

import html
s = "&amp;"
decoded = html.unescape(s)
# &

答案 7 :(得分:0)

$ python3 -c "
> import html
> print(
>     html.unescape('&amp;&#169;&#x2014;')
> )"
&©—

$ python2 -c "
> from HTMLParser import HTMLParser
> print(
>     HTMLParser().unescape('&amp;&#169;&#x2014;')
> )"
&©—

答案 8 :(得分:0)

windows-build-tools