Question

我正在尝试解码 NOT utf-8编码的HTML页面。

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

有没有可以做到这一点的图书馆？我无法在网上找到一个。

P.S当然，我可以使用goquery和iconv-go提取字符集并解码HTML页面，但我试图不重新发明轮子。

Answer 1

Golang正式提供了扩展程序包：charset和encoding。

下面的代码确保HTML包正确解析文档：

func detectContentCharset(body io.Reader) string {
    r := bufio.NewReader(body)
    if data, err := r.Peek(1024); err == nil {
        if _, name, ok := charset.DetermineEncoding(data, ""); ok {
            return name
        }
    }
    return "utf-8"
}

// Decode parses the HTML body on the specified encoding and
// returns the HTML Document.
func Decode(body io.Reader, charset string) (interface{}, error) {
    if charset == "" {
        charset = detectContentCharset(body)
    }
    e, err := htmlindex.Get(charset)
    if err != nil {
        return nil, err
    }

    if name, _ := htmlindex.Name(e); name != "utf-8" {
        body = e.NewDecoder().Reader(body)
    }

    node, err := html.Parse(body)
    if err != nil {
        return nil, err
    }
    return node, nil
}

Answer 2

goquery可能会满足您的需求。 e.g：

import "https://github.com/PuerkitoBio/goquery"

func main() {
    d, err := goquery.NewDocument("http://www.google.com")
    dh := d.Find("head")
    dc := dh.Find("meta[http-equiv]")
    c, err := dc.Attr("content") // get charset
    // ...
}

使用Document结构可以找到更多操作。

golang HTML charset解码

2 个答案: