我有一个字符串,其中可以显示'
或"
或&
(...)等特殊字符。在字符串中:
string = """ Hello "XYZ" this 'is' a test & so on """
如何自动逃避每个特殊角色,以便我明白:
string = " Hello "XYZ" this 'is' a test & so on "
答案 0 :(得分:37)
在Python 3.2中,您可以使用html.escape
function,例如
>>> string = """ Hello "XYZ" this 'is' a test & so on """
>>> import html
>>> html.escape(string)
' Hello "XYZ" this 'is' a test & so on '
对于早期版本的Python,请检查http://wiki.python.org/moin/EscapingHtml:
Python附带的
cgi
module有一个escape()
function:import cgi s = cgi.escape( """& < >""" ) # s = "& < >"
但是,它不会转义
&
,<
和>
之外的字符。如果它用作cgi.escape(string_to_escape, quote=True)
,它也会转义"
。
这是一个小片段,可以让你逃脱引号和撇号:
html_escape_table = { "&": "&", '"': """, "'": "'", ">": ">", "<": "<", } def html_escape(text): """Produce entities within text.""" return "".join(html_escape_table.get(c,c) for c in text)
您还可以使用
escape()
fromxml.sax.saxutils
来转义html。此功能应该更快地执行。同一模块的unescape()
函数可以传递相同的参数来解码字符串。from xml.sax.saxutils import escape, unescape # escape() and unescape() takes care of &, < and >. html_escape_table = { '"': """, "'": "'" } html_unescape_table = {v:k for k, v in html_escape_table.items()} def html_escape(text): return escape(text, html_escape_table) def html_unescape(text): return unescape(text, html_unescape_table)
答案 1 :(得分:5)
cgi.escape方法会将特殊字符转换为有效的html标记
import cgi
original_string = 'Hello "XYZ" this \'is\' a test & so on '
escaped_string = cgi.escape(original_string, True)
print original_string
print escaped_string
将导致
Hello "XYZ" this 'is' a test & so on
Hello "XYZ" this 'is' a test & so on
cgi.escape上的可选第二个参数会转义引号。默认情况下,它们不会被转义
答案 2 :(得分:4)
一个简单的字符串函数就可以了:
def escape(t):
"""HTML-escape the text in `t`."""
return (t
.replace("&", "&").replace("<", "<").replace(">", ">")
.replace("'", "'").replace('"', """)
)
此主题中的其他答案存在小问题:由于某种原因,cgi.escape方法忽略单引号,您需要明确要求它做双引号。链接的wiki页面全部五个,但使用XML实体'
,它不是HTML实体。
此代码函数始终使用HTML标准实体完成所有五个。
答案 3 :(得分:0)
此处的其他答案将有助于您列出的字符和其他一些字符。但是,如果您还希望将其他所有内容转换为实体名称,则必须执行其他操作。例如,如果á
需要转换为á
,则cgi.escape
和html.escape
都不会帮助您。你会想做一些使用html.entities.entitydefs
的东西,这只是一本字典。 (以下代码是针对Python 3.x制作的,但是部分尝试使其与2.x兼容以提供您的想法):
# -*- coding: utf-8 -*-
import sys
if sys.version_info[0]>2:
from html.entities import entitydefs
else:
from htmlentitydefs import entitydefs
text=";\"áèïøæỳ" #This is your string variable containing the stuff you want to convert
text=text.replace(";", "$ஸ$") #$ஸ$ is just something random the user isn't likely to have in the document. We're converting it so it doesn't convert the semi-colons in the entity name into entity names.
text=text.replace("$ஸ$", ";") #Converting semi-colons to entity names
if sys.version_info[0]>2: #Using appropriate code for each Python version.
for k,v in entitydefs.items():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
else:
for k,v in entitydefs.iteritems():
if k not in {"semi", "amp"}:
text=text.replace(v, "&"+k+";") #You have to add the & and ; manually.
#The above code doesn't cover every single entity name, although I believe it covers everything in the Latin-1 character set. So, I'm manually doing some common ones I like hereafter:
text=text.replace("ŷ", "ŷ")
text=text.replace("Ŷ", "Ŷ")
text=text.replace("ŵ", "ŵ")
text=text.replace("Ŵ", "Ŵ")
text=text.replace("ỳ", "ỳ")
text=text.replace("Ỳ", "Ỳ")
text=text.replace("ẃ", "&wacute;")
text=text.replace("Ẃ", "&Wacute;")
text=text.replace("ẁ", "ẁ")
text=text.replace("Ẁ", "Ẁ")
print(text)
#Python 3.x outputs: ;"áèïøæỳ
#The Python 2.x version outputs the wrong stuff. So, clearly you'll have to adjust the code somehow for it.