Question

我正在使用urllib2.urlopen来获取网址并获取标题信息，例如'charset'，'content-length'。

但是有些页面用

之类的东西设置了它们的字符集

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

urllib2并没有为我解析这个问题。

我是否可以使用任何内置工具来获取http-equiv信息？

修改：

这就是我从页面解析charset所做的工作

elem = lxml.html.fromstring(page_source)
content_type = elem.xpath(
        ".//meta[translate(@http-equiv, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz')='content-type']/@content")
if content_type:
    content_type = content_type[0]
    for frag in content_type.split(';'):
        frag = frag.strip().lower()
        i = frag.find('charset=')
        if i > -1:
            return frag[i+8:] # 8 == len('charset=')

return None

我该如何改进？我可以预编译xpath查询吗？

Answer 1

是啊！任何html解析库都会有所帮助。

BeautifulSoup是基于sgmllib的纯python库， lxml是用c

编写的更高效的替代python库

尝试其中任何一个。他们会解决你的问题。

Answer 2

我需要为online http fetcher解析此问题（除其他事项外）。我使用 lxml 来解析页面并获取元等效标题，大致如下：

    from lxml.html import parse

    doc = parse(url)
    nodes = doc.findall("//meta")
    for node in nodes:
        name = node.attrib.get('name')
        id = node.attrib.get('id')
        equiv = node.attrib.get('http-equiv')
        if equiv.lower() == 'content-type':
            ... do your thing ...

你可以做一个更有趣的查询来直接获取相应的标签（通过在查询中指定name =），但在我的情况下，我正在解析所有元标记。我会把这作为练习留给你，here is the relevant lxml documentation。

Beautifulsoup被认为有些不赞成，不再积极开发。

Answer 3

使用BeautifulSoup

查找'http-equiv'

import urllib2
from BeautifulSoup import BeautifulSoup

f  = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(f) # trust BeautifulSoup to parse the encoding
for meta in soup.findAll('meta', attrs={
    'http-equiv': lambda x: x and x.lower() == 'content-type'}):
    print("content-type: %r" % meta['content'])
    break
else:
    print('no content-type found')

#NOTE: strings in the soup are Unicode, but we can ask about charset
#      declared in the html 
print("encoding: %s" % (soup.declaredHTMLEncoding,))

Answer 4

构建自己的HTML解析器比你想象的要困难得多，而且我之前的答案是建议使用库来实现它。但我建议html5lib而不是BeautifulSoup和lxml。它是最能模仿浏览器解析页面的解析器，例如编码：

解析树总是Unicode。但是，支持多种输入编码。文档的编码按以下方式确定：

可以通过将编码名称作为编码参数传递给HTMLParser.parse来显式指定编码

如果未指定编码，解析器将尝试从文档的前512个字节中的元素检测编码（这只是当前HTML 5规范的部分实现）

如果找不到编码并且chardet库可用，则会尝试从字节模式中嗅探编码

如果所有其他方法都失败了，将使用默认编码（通常是Windows-1252）

来自：http://code.google.com/p/html5lib/wiki/UserDocumentation

如何在python中获得`http-equiv`？

4 个答案:

使用BeautifulSoup