Question

我有一个字符串，其中可以显示'或"或&（...）等特殊字符。在字符串中：

string = """ Hello "XYZ" this 'is' a test & so on """

如何自动逃避每个特殊角色，以便我明白：

string = " Hello &quot;XYZ&quot; this &#39;is&#39; a test &amp; so on "

Answer 1

在Python 3.2中，您可以使用html.escape function，例如

>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello &quot;XYZ&quot; this &#x27;is&#x27; a test &amp; so on '

对于早期版本的Python，请检查http://wiki.python.org/moin/EscapingHtml：

Python附带的cgi module有一个escape() function：
import cgi

s = cgi.escape( """& < >""" )   # s = "&amp; &lt; &gt;"
但是，它不会转义&，<和>之外的字符。如果它用作cgi.escape(string_to_escape, quote=True)，它也会转义"。


这是一个小片段，可以让你逃脱引号和撇号：
 html_escape_table = {
     "&": "&amp;",
     '"': "&quot;",
     "'": "&apos;",
     ">": "&gt;",
     "<": "&lt;",
     }

 def html_escape(text):
     """Produce entities within text."""
     return "".join(html_escape_table.get(c,c) for c in text)
您还可以使用escape() from xml.sax.saxutils来转义html。此功能应该更快地执行。同一模块的unescape()函数可以传递相同的参数来解码字符串。
from xml.sax.saxutils import escape, unescape
# escape() and unescape() takes care of &, < and >.
html_escape_table = {
    '"': "&quot;",
    "'": "&apos;"
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_escape(text):
    return escape(text, html_escape_table)

def html_unescape(text):
    return unescape(text, html_unescape_table)

Answer 2

cgi.escape方法会将特殊字符转换为有效的html标记

 import cgi
 original_string = 'Hello "XYZ" this \'is\' a test & so on '
 escaped_string = cgi.escape(original_string, True)
 print original_string
 print escaped_string

将导致

Hello "XYZ" this 'is' a test & so on 
Hello &quot;XYZ&quot; this 'is' a test &amp; so on

cgi.escape上的可选第二个参数会转义引号。默认情况下，它们不会被转义

Answer 3

一个简单的字符串函数就可以了：

def escape(t):
    """HTML-escape the text in `t`."""
    return (t
        .replace("&", "&amp;").replace("<", "&lt;").replace(">", "&gt;")
        .replace("'", "&#39;").replace('"', "&quot;")
        )

此主题中的其他答案存在小问题：由于某种原因，cgi.escape方法忽略单引号，您需要明确要求它做双引号。链接的wiki页面全部五个，但使用XML实体'，它不是HTML实体。

此代码函数始终使用HTML标准实体完成所有五个。

Answer 4

此处的其他答案将有助于您列出的字符和其他一些字符。但是，如果您还希望将其他所有内容转换为实体名称，则必须执行其他操作。例如，如果á需要转换为á，则cgi.escape和html.escape都不会帮助您。你会想做一些使用html.entities.entitydefs的东西，这只是一本字典。（以下代码是针对Python 3.x制作的，但是部分尝试使其与2.x兼容以提供您的想法）：

# -*- coding: utf-8 -*-

import sys

if sys.version_info[0]>2:
    from html.entities import entitydefs
else:
    from htmlentitydefs import entitydefs

text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", "&semi;") #Converting semi-colons to entity names

if sys.version_info[0]>2: #Using appropriate code for each Python version.
    for k,v in entitydefs.items():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
    for k,v in entitydefs.iteritems():
        if k not in {"semi", "amp"}:
            text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.

#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "&ycirc;")
text=text.replace("Ŷ", "&Ycirc;")
text=text.replace("ŵ", "&wcirc;")
text=text.replace("Ŵ", "&Wcirc;")
text=text.replace("ỳ", "&#7923;")
text=text.replace("Ỳ", "&#7922;")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "&#7809;")
text=text.replace("Ẁ", "&#7808;")

print(text)
#Python 3.x outputs: &semi;&quot;&aacute;&egrave;&iuml;&oslash;&aelig;&#7923;
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.

逃避Python中的特殊HTML字符

4 个答案: