Question

使用BeautifulSoup或PyQuery解析某些HTML时，他们将使用lxml或html5lib之类的解析器。假设我有一个包含以下内容的文件

<span>  é    and    ’  </span>

在我的环境中，它们似乎使用PyQuery编码不正确：

>>> doc = pq(filename=PATH, parser="xml")
>>> doc.text()
'Ã© and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html")
>>> doc.text()
'Ã\x83Â© and Ã¢â\x82¬â\x84¢'
>>> doc = pq(filename=PATH, parser="soup")
>>> doc.text()
'Ã© and â\u20ac\u2122'
>>> doc = pq(filename=PATH, parser="html5")
>>> doc.text()
'Ã© and â\u20ac\u2122'

除了编码似乎不正确之外，主要问题之一是doc.text()返回str的实例而不是bytes，根据{ {3}}我昨天问。

此外，将参数encoding='utf-8'传递给PyQuery似乎毫无用处，我尝试'latin1'进行任何更改。我还尝试添加一些元数据，因为我读过lxml来读取它们，以找出要使用的编码，但它没有任何改变：

<!DOCTYPE html>
<html lang="fr" dir="ltr">
<head>
<meta http-equiv="content-type" content="text/html;charset=latin1"/>
<span>  é    and    ’  </span>
</head>
</html>

如果我直接使用lxml，似乎有点不同

>>> from lxml import etree
>>> tree = etree.parse(PATH)
>>> tree.docinfo.encoding
'UTF-8'

>>> result = etree.tostring(tree.getroot(), pretty_print=False)
>>> result
b'<span>  &#233;    and    &#8217;  </span>'

>>> import html
>>> html.unescape(result.decode('utf-8'))
'<span>  é    and    \u2019  </span>\n'

Erf，这让我有些疯狂，您的帮助将不胜感激

Answer 1

我想我明白了。看起来，即使BeautifulSoup或PyQuery都可以启用它，直接打开包含某些特殊UTF-8字符的文件也是一个坏主意。特别是让我最困惑的是Windows终端似乎无法正确处理的'’符号。因此，解决方案是在解析文件之前对其进行预处理：

def pre_process_html_content(html_content, encoding=None):
    """Pre process bytes coming from file or request."""
    if not isinstance(html_content, bytes):
        raise TypeError("html_content must a bytes not a " + str(type(html_content)))

    html_content = html_content.decode(encoding)


    # Handle weird symbols here
    html_content = html_content.replace('\u2019', "'")

    return html_content


def sanitize_html_file(path, encoding=None):
    with open(path, 'rb') as f:
        content = f.read()
    encoding = encoding or 'utf-8'

    return pre_process_html_content(content, encoding)


def open_pq(path, parser=None, encoding=None):
    """Macro for open HTML file with PyQuery."""
    content = sanitize_html_file(path, encoding)
    parser = parser or 'xml'

    return pq(content, parser=parser)


doc = open_pq(PATH)

Python3 html和lxml解析器编码问题

1 个答案: