Question

我正在使用this function来逃避HTML enities

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

但是当我尝试处理一些文本时我得到了这个错误，（大部分文本都有效）但python却抛出了这个错误

File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
  return codecs.charmap_encode(input,errors,encoding_map)
  UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 3
 48: character maps to <undefined>

我已经尝试过百万种不同的方式对文本字符串进行编码，到目前为止，没有任何工作ascii，utf，unicode ...所有那些我真的不明白的东西

Answer 1

根据错误消息，您可能正在尝试将unicode字符串转换为CP 437（IBM PC字符集）。这似乎不在您的函数中发生，但在尝试将结果字符串打印到控制台时可能会发生。我使用输入字符串"® some text"进行了快速测试，并且能够在打印结果字符串时重现失败：

print unescape("&#xae; some text")

您可以通过指定要将unicode字符串转换为的编码来避免这种情况：

print unescape("&#xae; some text").encode('utf-8')

如果您尝试将此字符串打印到控制台，您将看到非ascii字符，但是如果您将其写入文件并在支持utf-8编码文档的查看器中读取它，您应该看到字符你期望的。

Answer 2

您需要发布FULL回溯，以便我们可以看到错误发生在您的代码中的哪个位置。您还需要向我们展示repr（存在此问题的小数据） - 您的数据至少为348个字节。

根据最初提供的信息：

您尝试使用unicode ...

对cp437字符进行编码时崩溃

要么（1）错误发生在显示的代码中的某个位置，有人将您的默认编码设为cp437（不要这样做）

或（2）错误不会发生在您向我们展示的代码中的任何位置，当您尝试打印函数的某些结果时，会发生错误，您正在Windows“命令提示符”窗口中运行，所以你的sys.stdout.encoding被设置为一些不支持U + 00AE字符的传统MS-DOS编码。

Answer 3

你需要使用编码方法转换结果，应用'utf-8'之类的编码，例如。

strdata =  (result).encode('utf-8')

print strdata

转换html实体和编码的问题

3 个答案: