Question

我已经四处寻找并且只找到了python 2.6及更早版本的解决方案，没有关于如何在python 3.X中执行此操作。（我只能访问Win7盒子。）

我必须能够在3.1中执行此操作，并且最好不使用外部库。目前，我安装了httplib2并访问命令提示符卷曲（这就是我获取页面源代码的方式）。不幸的是，curl不解码html实体，据我所知，我找不到在文档中解码它的命令。

是的，我试图让美丽的汤工作，很多时候没有成功3.X.如果您可以在MS Windows环境中提供有关如何在python 3中使用它的EXPLICIT说明，我将非常感激。

所以，要清楚，我需要将这样的字符串：Suzy & John变成这样的字符串：“Suzy＆amp; John”。

Answer 1

您可以使用函数html.unescape：

在 Python3.4 + 中（感谢J.F. Sebastian的更新）：

import html
html.unescape('Suzy &amp; John')
# 'Suzy & John'

html.unescape('&quot;')
# '"'

Python3.3 或更早版本：

import html.parser    
html.parser.HTMLParser().unescape('Suzy &amp; John')

在 Python2 ：

中

import HTMLParser
HTMLParser.HTMLParser().unescape('Suzy &amp; John')

Answer 2

您可以使用xml.sax.saxutils.unescape来实现此目的。该模块包含在Python标准库中，可在Python 2.x和Python 3.x之间移植。

>>> import xml.sax.saxutils as saxutils
>>> saxutils.unescape("Suzy &amp; John")
'Suzy & John'

Answer 3

显然我没有足够的声誉做任何事情，只是发布这个。 unutbu的答案并没有取消引用。我发现的唯一一件事就是这个功能：

import re
from htmlentitydefs import name2codepoint as n2cp

def decodeHtmlentities(string):
    def substitute_entity(match):        
        ent = match.group(2)
        if match.group(1) == "#":
            return unichr(int(ent))
        else:
            cp = n2cp.get(ent)
            if cp:
                return unichr(cp)
            else:
                return match.group()
    entity_re = re.compile("&(#?)(\d{1,5}|\w{1,8});")
    return entity_re.subn(substitute_entity, string)[0]

我从page得到的。

Answer 4

Python 3.x也有html.entities

Answer 5

在我的情况下，我在as3转义函数中有一个html字符串转义。经过一个小时的谷歌搜索，没有找到任何有用的东西，所以我写了这个recusrive函数来满足我的需求。在这里，

def unescape(string):
    index = string.find("%")
    if index == -1:
        return string
    else:
        #if it is escaped unicode character do different decoding
        if string[index+1:index+2] == 'u':
            replace_with = ("\\"+string[index+1:index+6]).decode('unicode_escape')
            string = string.replace(string[index:index+6],replace_with)
        else:
            replace_with = string[index+1:index+3].decode('hex')
            string = string.replace(string[index:index+3],replace_with)
        return unescape(string)

Edit-1 添加了处理unicode字符的功能。

Answer 6

我不确定这是否是内置库，但它看起来像你需要的并支持3.1。

来自：http://docs.python.org/3.1/library/xml.sax.utils.html?highlight=html%20unescape

xml.sax.saxutils.unescape（data，entities = {}） Unescape'＆amp;'，'＆lt;'和'＆gt;'在一串数据中。

如何在Python 3.1中以字符串形式隐藏HTML实体？

6 个答案: