Question

我需要使用Python将任何html实体转换为其ASCII等价物。我的用例是我正在清理一些用于构建电子邮件的HTML，以便从HTML创建明文电子邮件。

现在，当我需要ASCII（我认为）时，我才真正知道如何从这些实体创建unicode，以便明文电子邮件能够正确读取重音字符等内容。我认为一个基本的例子是html实体“＆amp; aacute;”或者á被编码为ASCII。

此外，我甚至不能100％确定ASCII是明文电子邮件所需要的。正如你所知，我完全迷失在这个编码的东西上。

Answer 1

这是一个完整的实现，也可以处理unicode html实体。您可能会发现它很有用。

它返回一个不是ascii的unicode字符串，但如果你想要简单的ascii，你可以修改替换操作，以便它将实体替换为空字符串。

def convert_html_entities(s):
    matches = re.findall("&#\d+;", s)
    if len(matches) > 0:
        hits = set(matches)
        for hit in hits:
            name = hit[2:-1]
            try:
                entnum = int(name)
                s = s.replace(hit, unichr(entnum))
            except ValueError:
                pass

    matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
    if len(matches) > 0:
        hits = set(matches)
        for hit in hits:
            hex = hit[3:-1]
            try:
                entnum = int(hex, 16)
                s = s.replace(hit, unichr(entnum))
            except ValueError:
                pass

    matches = re.findall("&\w+;", s)
    hits = set(matches)
    amp = "&amp;"
    if amp in hits:
        hits.remove(amp)
    for hit in hits:
        name = hit[1:-1]
        if htmlentitydefs.name2codepoint.has_key(name):
            s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
    s = s.replace(amp, "&")
    return s

编辑：为十六进制代码添加了匹配项。我已经使用了一段时间了，并且遇到了'这是单引号/撇号的第一种情况。

Answer 2

ASCII是美国信息交换标准码，不包含任何重音字母。你最好的选择是获得Unicode（正如你所说的那样）并将其编码为UTF-8（如果您正在处理严重错误编码的用户代理/客户，叹息时可能是ISO-8859-1或一些奇怪的代码页） - - 该部分的内容类型标题以及text / plain可以表示您选择使用的编码（我建议尝试使用UTF-8，除非您已经证明它无法正常工作 - 它几乎这些天普遍支持，比任何ISO-8859或“代码页”黑客都更灵活！）。

Answer 3

您可以使用htmlentitydefs包：

import htmlentitydefs
print htmlentitydefs.entitydefs['aacute']

基本上，entitydefs只是一个字典，您可以通过在python提示符下打印来看到它：

from pprint import pprint 
pprint htmlentitydefs.entitydefs

Answer 4

我们用agazso的功能建立了一个小模块：

http://github.com/ARTFL/util/blob/master/ents.py

我们发现agazso的功能比ent转换的替代方案更快。感谢您发布它。

在Python中将html实体转换为ascii

4 个答案: